赞
踩
优点 | 缺点 | 典型案例 | |
自己写MapReduce任务 | 性能比Hive和Pig要高点 | 开发难度大 | 1) 搜索引擎网页处理,PageRank计算(Google)2) 典型的ETL(全盘扫描)3) 机器学习/聚类,分类,推荐等(百度Ecomm) |
使用Hive做基于SQL的分析 | 对于数据分析师来说SQL太熟悉了 | 有些场景下性能不如MR | 1) 用户访问日志处理/互联网广告(Yahoo, Facebook, hulu, Amazon)2) 电子商务(淘宝的云梯) |
使用Pig做数据分析 | Pig的语法不是很普及 | 有些场景下性能不如MR | 统计和机器学习(Yahoo, twitter) |
基于HBase开发的系统 | 基本可以达到准实时统计分析功能 | 目前没有开源实现,开发成本高 | 大多是自有系统,例如Google的Percolator, 淘宝的prom |
HIVE_HOME=/usr/local/hive
PATH=$PATH:$PIG_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin
CLASSPATH=$CLASSPATH:$HIVE_HOME/lib
方案A: 可以写MapReduce程序来进行日志分析(每天的点击数, 每小时点击数,IP的点击数,Domain的点击数)196.13.32.71 - - [09/Aug/1995:03:21:17 -0400] "GET /icons/image.xbm HTTP/1.0" 200 509
pc-128-78.ntc.nokia.com - - [09/Aug/1995:03:22:01 -0400] "GET /shuttle/technology/sts-newsref/ HTTP/1.0" 200 16376
gate.germany.eu.net - - [09/Aug/1995:03:22:02 -0400] "GET /ksc.html HTTP/1.0" 200 7131
注:只有Map程序,不需要Reduce。Map的key为NullWritable不用输出。public class LogHive extends Configured implements Tool{
private static final String INPUT_PATH = "hdfs://192.168.56.103:9000/feixu/logInput1";
private static final String OUTPUT_PATH = "hdfs://192.168.56.103:9000/feixu/logOutput2";
public static class LogHiveMapper extends Mapper<LongWritable, Text, NullWritable, Text>{
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
LogParser parser = LogParser.parser(value.toString());
Text out = new Text(parser.getIp() + "\t" + parser.getYear() + "\t" + parser.getMonth() + "\t" + parser.getDay() + "\t" + parser.getHour() + "\t" + parser.getUrl() + "\t" + parser.getStatus() + "\t" + parser.getSize());
context.write(NullWritable.get(), out);
}
}
@Override
public int run(String[] arg0) throws Exception {
Job job = new Job();
job.setJarByClass(HitOfDay.class);
job.setJobName("Log Hive Job");
FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.setMapperClass(LogHiveMapper.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.submit();
return job.waitForCompletion(true)? 0:1;
}
public static void main(String[] args) throws Exception{
int exitCode = ToolRunner.run(new LogHive(), args);
System.exit(exitCode);
}
}
分析每个月1~31日每天的点击量创建表:
CREATE TABLE log_history(ip STRING, year INT, month STRING, day INT, hour INT, url STRING, status INT,size INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
加载数据:
LOAD DATA INPATH '/usr/local/log_history' OVERWRITE INTO TABLE log_history;
hive> select day, count(day) from log_history group by day;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201312030304_0006, Tracking URL = http://feixu-master:50030/jobdetails.jsp?jobid=job_201312030304_0006
Kill Command = /usr/local/hadoop/libexec/../bin/hadoop job -kill job_201312030304_0006
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2013-12-04 10:43:13,586 Stage-1 map = 0%, reduce = 0%
2013-12-04 10:43:26,703 Stage-1 map = 12%, reduce = 0%
2013-12-04 10:43:28,728 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:29,743 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:30,759 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:31,777 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:32,797 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:33,816 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:34,831 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 2.53 sec
2013-12-04 10:43:35,864 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 9.47 sec
2013-12-04 10:43:36,879 Stage-1 map = 88%, reduce = 0%, Cumulative CPU 9.47 sec
2013-12-04 10:43:37,890 Stage-1 map = 88%, reduce = 0%, Cumulative CPU 9.47 sec
2013-12-04 10:43:38,904 Stage-1 map = 88%, reduce = 0%, Cumulative CPU 9.47 sec
2013-12-04 10:43:39,919 Stage-1 map = 88%, reduce = 0%, Cumulative CPU 9.47 sec
2013-12-04 10:43:40,943 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.49 sec
2013-12-04 10:43:41,959 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.49 sec
2013-12-04 10:43:42,975 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.49 sec
2013-12-04 10:43:43,989 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 12.49 sec
2013-12-04 10:43:45,002 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.7 sec
2013-12-04 10:43:46,019 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.7 sec
2013-12-04 10:43:47,031 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.7 sec
2013-12-04 10:43:48,046 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.7 sec
MapReduce Total cumulative CPU time: 13 seconds 700 msec
Ended Job = job_201312030304_0006
MapReduce Jobs Launched:
Job 0: Map: 2 Reduce: 1 Cumulative CPU: 13.7 sec HDFS Read: 303841163 HDFS Write: 288 SUCCESS
Total MapReduce CPU Time Spent: 13 seconds 700 msec
OK
1 98710
2 60265
3 130972
4 130009
5 126468
6 133380
7 144595
8 99024
9 95730
10 134108
11 141653
12 130607
13 170683
14 143981
15 104379
16 104507
17 133969
18 120528
19 104832
20 99556
21 120169
22 93029
23 97296
24 116811
25 120020
26 90457
27 94503
28 82617
29 67988
30 80641
31 90125
Time taken: 42.899 seconds, Fetched: 31 row(s)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。