当前位置:   article > 正文

【云计算与大数据计算】Hadoop MapReduce实战之统计每个单词出现次数、单词平均长度、Grep(附源码 )_wordmean

wordmean

需要全部代码请点赞关注收藏后评论区留言私信~~~

下面通过WordCount,WordMean等几个例子讲解MapReduce的实际应用,编程环境都是以Hadoop MapReduce为基础

一、WordCount

WordCount用于计算文件中每个单词出现的次数,非常适合采用MapReduce进行处理,处理单词计数问题的思路很简单,在 Map阶段处理每个文本split中的数据,产生<word,1> 这样的键-值对,在Reduce阶段对相同的关键字求和,最后生成所有的单词计数 。

运行示意图如下

运行结果如下 

 二、WordMean

对上面例子的代码稍作修改,改成计算所有文件中单词的平均长度,单词长度的定义是单词的字符个数,现在HDFS集群中有大量的文件,需要统计所有文件中所出现单词的平均长度。

三、Grep

还是进行大规模文本中单词的相关操作,现在希望提供类似Linux系统中的Grep命令的功能,找出匹配目标串的所有文件,并统计出每个文件中出现目标字符串的个数。

在 Map阶段根据提供的文件split信息、给定的每个字符串输出 <filename,1> 这样 的键-值对信息

在 Reduce阶段根据filename对 Map阶段产生的结果进行合并

运行效果如下

 四、代码

部分代码如下 全部代码请点赞关注收藏后评论区留言私信~

  1. package alibook.odps;
  2. import java.io.IOException;
  3. import java.util.Iterator;
  4. import com.aliyun.odps.data.Record;
  5. import com.aliyun.odps.data.TableInfo;
  6. import com.aliyun.odps.mapred.JobClient;
  7. import com.aliyun.odps.mapred.MapperBase;
  8. import com.aliyun.odps.mapred.ReducerBase;
  9. import com.aliyun.odps.mapred.conf.JobConf;
  10. import com.aliyun.odps.mapred.utils.InputUtils;
  11. import com.aliyun.odps.mapred.utils.OutputUtils;
  12. import com.aliyun.odps.mapred.utils.SchemaUtils;
  13. public class wordcount {
  14. public static class TokenizerMapper extends MapperBase {
  15. private Record word;
  16. private Record one;
  17. @Override
  18. public void setup(TaskContext context) throws IOException {
  19. word = context.createMapOutputKeyRecord();
  20. one = context.createMapOutputValueRecord();
  21. one.set(new Object[] { 1L });
  22. System.out.println("TaskID:" + context.getTaskID().toString());
  23. }
  24. @Override
  25. public void map(long recordNum, Record record, TaskContext context)
  26. throws IOException {
  27. for (int i = 0; i < record.getColumnCount(); i++) {
  28. word.set(new Object[] { record.get(i).toString() });
  29. context.write(word, one);
  30. }
  31. }
  32. }
  33. /**
  34. * A combiner class that combines map output by sum them.
  35. **/
  36. public static class SumCombiner extends ReducerBase {
  37. private Record count;
  38. @Override
  39. public void setup(TaskContext context) throws IOException {
  40. count = context.createMapOutputValueRecord();
  41. }
  42. @Override
  43. public void reduce(Record key, Iterator<Record> values, TaskContext context)
  44. throws IOException {
  45. long c = 0;
  46. while (values.hasNext()) {
  47. Record val = values.next();
  48. c += (Long) val.get(0);
  49. }
  50. count.set(0, c);
  51. context.write(key, count);
  52. }
  53. }
  54. /**
  55. * A reducer class that just emits the sum of the input values.
  56. **/
  57. public static class SumReducer extends ReducerBase {
  58. private Record result = null;
  59. @Override
  60. public void setup(TaskContext context) throws IOException {
  61. result = context.createOutputRecord();
  62. }
  63. @Override
  64. public void reduce(Record key, Iterator<Record> values, TaskContext context)
  65. throws IOException {
  66. long count = 0;
  67. while (values.hasNext()) {
  68. Record val = values.next();
  69. count += (Long) val.get(0);
  70. }
  71. result.set(0, key.get(0));
  72. result.set(1, count);
  73. context.write(result);
  74. }
  75. }
  76. public static void main(String[] args) throws Exception {
  77. if (args.length != 2) {
  78. System.err.println("Usage: WordCount <in_table> <out_table>");
  79. System.exit(2);
  80. }
  81. JobConf job = new JobConf();
  82. job.setMapperClass(TokenizerMapper.class);
  83. job.setCombinerClass(SumCombiner.class);
  84. job.setReducerClass(SumReducer.class);
  85. job.setMapOutputKeySchema(SchemaUtils.fromString("word:string"));
  86. job.setMapOutputValueSchema(SchemaUtils.fromString("count:bigint"));
  87. InputUtils.addTable(TableInfo.builder().tableName(args[0]).build(), job);
  88. OutputUtils.addTable(TableInfo.builder().tableName(args[1]).build(), job);
  89. JobClient.runJob(job);
  90. }
  91. }

pom.xml文件代码如下

  1. <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  2. xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  3. <modelVersion>4.0.0</modelVersion>
  4. <groupId>alibook</groupId>
  5. <artifactId>odps</artifactId>
  6. <version>0.0.1</version>
  7. <packaging>jar</packaging>
  8. <name>odps</name>
  9. <url>http://maven.apache.org</url>
  10. <properties>
  11. <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  12. </properties>
  13. <dependencies>
  14. <dependency>
  15. <groupId>junit</groupId>
  16. <artifactId>junit</artifactId>
  17. <version>3.8.1</version>
  18. <scope>test</scope>
  19. </dependency>
  20. <dependency>
  21. <groupId>com.aliyun.odps</groupId>
  22. <artifactId>odps-sdk-core</artifactId>
  23. <version>0.23.3-public</version>
  24. </dependency>
  25. <dependency>
  26. <groupId>com.aliyun.odps</groupId>
  27. <artifactId>odps-sdk-commons</artifactId>
  28. <version>0.23.3-public</version>
  29. </dependency>
  30. <dependency>
  31. <groupId>com.aliyun.odps</groupId>
  32. <artifactId>odps-sdk-mapred</artifactId>
  33. <version>0.23.3-public</version>
  34. </dependency>
  35. </dependencies>
  36. </project>

创作不易 觉得有帮助请点赞关注收藏~~~

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/337648
推荐阅读
相关标签
  

闽ICP备14008679号