分类

链接

2017 年 9 月
 123
45678910
11121314151617
18192021222324
252627282930  

近期文章

热门标签

新人福利,免费薅羊毛

现在位置:    首页 > 大数据 > 正文
共享办公室出租
Hadoop入门案例(一) wordcount
大数据 暂无评论 阅读(902)

1. 需求说明

大数据中,经常可能会碰到一些需要单词的出现个数,例如top n 等等。下面介绍一个hadoop的入门案例,对一个或多个文本中的单词进行统计

1.1 需求输入

输入一个或者多个文本 测试的文本内容如下

aa bb cc aa aa aa dd dd ee ee ee ee 
ff aa bb zks
ee kks
ee  zz zks
  • 1
  • 2
  • 3
  • 4

1.2 需求输出

将文本中的内容按照单词进行计数,并且将各个单词的统计记录到制定的路径下

2. 代码如下

<span class="hljs-keyword">package</span> com.myhadoop.mapreduce.test;

<span class="hljs-keyword">import</span> java.io.IOException;
<span class="hljs-keyword">import</span> java.util.*;
<span class="hljs-keyword">import</span> org.apache.hadoop.fs.Path;
<span class="hljs-keyword">import</span> org.apache.hadoop.conf.*;
<span class="hljs-keyword">import</span> org.apache.hadoop.io.*;
<span class="hljs-keyword">import</span> org.apache.hadoop.mapreduce.lib.input.*;
<span class="hljs-keyword">import</span> org.apache.hadoop.mapreduce.lib.output.*;
<span class="hljs-keyword">import</span> org.apache.hadoop.mapreduce.*;


<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">WordCount</span>{</span>

    <span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Map</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Mapper</span>&lt;<span class="hljs-title">LongWritable</span>, <span class="hljs-title">Text</span>, <span class="hljs-title">Text</span>, <span class="hljs-title">IntWritable</span>&gt;
    {</span>
        <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> <span class="hljs-keyword">static</span> IntWritable one = <span class="hljs-keyword">new</span> IntWritable(<span class="hljs-number">1</span>);
        <span class="hljs-keyword">private</span> Text word = <span class="hljs-keyword">new</span> Text();
        <span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">map</span>(LongWritable key,Text value,Context context) <span class="hljs-keyword">throws</span> IOException,InterruptedException
        {
            String lines = value.toString();
            StringTokenizer tokenizer = <span class="hljs-keyword">new</span> StringTokenizer(lines,<span class="hljs-string">" "</span>);
            <span class="hljs-keyword">while</span>(tokenizer.hasMoreElements())
            {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
        }
    }

    <span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Reduce</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">Reducer</span>&lt;<span class="hljs-title">Text</span>, <span class="hljs-title">IntWritable</span>, <span class="hljs-title">Text</span>, <span class="hljs-title">IntWritable</span>&gt;
    {</span>
        <span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">reduce</span>(Text key,Iterable&lt;IntWritable&gt; values,Context context) <span class="hljs-keyword">throws</span> IOException,InterruptedException
        {
            <span class="hljs-keyword">int</span> sum = <span class="hljs-number">0</span>;
            <span class="hljs-keyword">for</span> (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, <span class="hljs-keyword">new</span> IntWritable(sum));
        }
    }

    <span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">main</span>(String[] args) <span class="hljs-keyword">throws</span> Exception {
        Configuration conf = <span class="hljs-keyword">new</span> Configuration();
        Job job = Job.getInstance(conf);
        job.setJarByClass(WordCount.class);
        job.setJobName(<span class="hljs-string">"WordCount"</span>);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        FileInputFormat.setInputPaths(job, <span class="hljs-keyword">new</span> Path(args[<span class="hljs-number">0</span>]));
        FileOutputFormat.setOutputPath(job, <span class="hljs-keyword">new</span> Path(args[<span class="hljs-number">1</span>]));
        System.exit(job.waitForCompletion(<span class="hljs-keyword">true</span>) ? <span class="hljs-number">0</span> : <span class="hljs-number">1</span>);

    }
}


  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65

3. 代码输出

aa  5
bb  2
cc  1
dd  2
ee  6
ff  1
kks 1
zks 2
zz  1
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

4. 代码解析

原理:先将文本每行内容进行分割,每行都得到一些单词,将单词都转变成map,并且key设置成单词,value设置成1,经过shuffle,把相同key的value放在一个list中,reduce过程把
相同key中的value进行相加,最后输出key为单词,value为总数
Map类:用于将每一行的单词变成map,如 (aa,1),(bb,1)…等。
输入是: LongWritable, Text
输出是: Text, IntWritable
Reduce类:用于将map后,并且经过shuffle的数据,进行整合处理
输入是:Text,Iterable
输出是:Text, IntWritable

============ 欢迎各位老板打赏~ ===========

本文版权归Bruce's Blog所有,转载引用请完整注明以下信息:
本文作者:Bruce
本文地址:Hadoop入门案例(一) wordcount | Bruce's Blog

发表评论

留言无头像?