赞
踩
用户的历史数据,戴止到20131215,压缩后221MB,解压后878MB,整个数据1206个小文件,所有数据格式均是json格式
数据下载链接
[{
"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387165034","commentCount":"6","content":"Raresmileyportrait(1977)","createTime":"1387130972","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww2.sinaimg.cn/thumbnail/69d3e27djw1ebkxp7rtczj20mo0mogmy.jpg"],"praiseCount":"5","reportCount":"70","source":"","userId":"1775493757","videourl":[],"weiboId":"3655954636173507","weiboUrl":"http://weibo.com/1775493757/AntDppU0H"}]
[{
"beCommentWeiboId":"","beForwardWeiboId":"3655954636173507","catchTime":"1387165034","commentCount":"29","content":"玲笑容!","createTime":"1387139090","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":[],"praiseCount":"72","reportCount":"61","source":"新浪微博","userId":"1719481457","videourl":[],"weiboId":"3655988685551869","weiboUrl":"http://weibo.com/1719481457/Anuwkniih"}]
[{
"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387165034","commentCount":"4","content":"lifeisbeautifulandallisaboutconfident&trust&friends&LOVE,thanksto@黄伟文,youmakemefeellikehongkongismagic&happiness.","createTime":"1387053188","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":[],"praiseCount":"8","reportCount":"8","source":"","userId":"1733190683","videourl":[],"weiboId":"3655628385727081","weiboUrl":"http://weibo.com/1733190683/Anl9co1Sh"}]
共19个字段:
beCommentWeiboId 是否评论 beForwardWeiboId 是否是转发微博 catchTime 抓取时间 commentCount 评论次数 content 内容 createTime 创建时间 info1 信息字段1 info2信息字段2 info3信息字段3 mlevel no sure musicurl 音乐链接 pic_list 照片列表(可以有多个) praiseCount 点赞人数 reportCount 转发人数 source 数据来源 userId 用户id videourl 视频链接 weiboId 微博id weiboUrl 微博网址
hdfs://hdp01:9000/data/weibo
建表的时候,建外表
[hdp01@hdp01 weibo]$ hdfs dfs -ls /data/weibo
Found 2 items
-rw-r--r-- 2 hdp01 supergroup 1004992 2020-01-11 16:17 /data/weibo/1387159770_1087770692_20100101000000_VCSvoMgPvrSTKhCkkIA7uMV9Hn10877706927159770ouss.json
-rw-r--r-- 2 hdp01 supergroup 680641 2020-01-11 16:17 /data/weibo/1387159770_1180721740_20100101000000_tBx94gQvEoOWTiB4n3gORSmS11807217407159771ouss.json
hive> set hive.exec.model.local.auto=true;
--hive> set hive.cli.print.header=true;
hive> create database weibo;
hive> use weibo;
数据文件过多:要合并,请给出解决方案
mapreduce
(创建Hive表weibo_json(json string),表只有一个字段,导入所有数据,并验证查询前5条数据)
(解析完weibo_json当中的json格式数据到拥有19个字段的weibo表中,写出必要的SQL语句)
创建weibo_json表
hive> create external table if not exists weibo_json(
> json string)
> location "/data/weibo";
-- 因为我创建的外部表,location指向了/data/weibo,所以表创建完成直接就可以读数据了
hive> select * from weibo_json limit 2;
OK
[{
"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387159495","commentCount":"1419","content":"分享图片","createTime":"1386981067","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww3.sinaimg.cn/thumbnail/40d61044jw1ebixhnsiknj20qo0qognx.jpg"],"praiseCount":"5265","reportCount":"1285","source":"iPad客户端","userId":
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。