Hadoop入门案例(一) wordcount
1. 需求说明
大数据中,经常可能会碰到一些需要单词的出现个数,例如top n 等等。下面介绍一个hadoop的入门案例,对一个或多个文本中的单词进行统计
1.1 需求输入
输入一个或者多个文本 测试的文本内容如下
aa bb cc aa aa aa dd dd ee ee ee ee
ff aa bb zks
ee kks
ee zz zks
- 1
- 2
- 3
- 4
1.2 需求输出
将文本中的内容按照单词进行计数,并且将各个单词的统计记录到制定的路径下
2. 代码如下
<span class="hljs-keyword">package</span> com.myhadoop.mapreduce.test;
<span class="hljs-keyword">import</span> java.io.IOException;
<span class="hljs-keyword">import</span> java.util.*;
<span class="hljs-keyword">import</span> org.apache.hadoop.fs.Path;
<span class="hljs-keyword">import</span> org.apache.hadoop.conf.*;
<span class="hljs-keyword">import</span> org.apache.hadoop.io.*;
<span class="hljs-keyword">import</span> org.apache.hadoop.mapreduce.lib.input.*;
<span class="hljs-keyword">import</span> org.apache.hadoop.mapreduce.lib.output.*;
<span class="hljs-keyword">import</span> org.apache.hadoop.mapreduce.*;
<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">WordCount</span>{</span>
<span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Map</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Mapper</span><<span class="hljs-title">LongWritable</span>, <span class="hljs-title">Text</span>, <span class="hljs-title">Text</span>, <span class="hljs-title">IntWritable</span>>
{</span>
<span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> <span class="hljs-keyword">static</span> IntWritable one = <span class="hljs-keyword">new</span> IntWritable(<span class="hljs-number">1</span>);
<span class="hljs-keyword">private</span> Text word = <span class="hljs-keyword">new</span> Text();
<span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">map</span>(LongWritable key,Text value,Context context) <span class="hljs-keyword">throws</span> IOException,InterruptedException
{
String lines = value.toString();
StringTokenizer tokenizer = <span class="hljs-keyword">new</span> StringTokenizer(lines,<span class="hljs-string">" "</span>);
<span class="hljs-keyword">while</span>(tokenizer.hasMoreElements())
{
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
<span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Reduce</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Reducer</span><<span class="hljs-title">Text</span>, <span class="hljs-title">IntWritable</span>, <span class="hljs-title">Text</span>, <span class="hljs-title">IntWritable</span>>
{</span>
<span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">reduce</span>(Text key,Iterable<IntWritable> values,Context context) <span class="hljs-keyword">throws</span> IOException,InterruptedException
{
<span class="hljs-keyword">int</span> sum = <span class="hljs-number">0</span>;
<span class="hljs-keyword">for</span> (IntWritable val : values) {
sum += val.get();
}
context.write(key, <span class="hljs-keyword">new</span> IntWritable(sum));
}
}
<span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">main</span>(String[] args) <span class="hljs-keyword">throws</span> Exception {
Configuration conf = <span class="hljs-keyword">new</span> Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(WordCount.class);
job.setJobName(<span class="hljs-string">"WordCount"</span>);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, <span class="hljs-keyword">new</span> Path(args[<span class="hljs-number">0</span>]));
FileOutputFormat.setOutputPath(job, <span class="hljs-keyword">new</span> Path(args[<span class="hljs-number">1</span>]));
System.exit(job.waitForCompletion(<span class="hljs-keyword">true</span>) ? <span class="hljs-number">0</span> : <span class="hljs-number">1</span>);
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
3. 代码输出
aa 5
bb 2
cc 1
dd 2
ee 6
ff 1
kks 1
zks 2
zz 1
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
4. 代码解析
原理:先将文本每行内容进行分割,每行都得到一些单词,将单词都转变成map,并且key设置成单词,value设置成1,经过shuffle,把相同key的value放在一个list中,reduce过程把
相同key中的value进行相加,最后输出key为单词,value为总数
Map类:用于将每一行的单词变成map,如 (aa,1),(bb,1)…等。
输入是: LongWritable, Text
输出是: Text, IntWritable
Reduce类:用于将map后,并且经过shuffle的数据,进行整合处理
输入是:Text,Iterable
输出是:Text, IntWritable