当前位置:   article > 正文

Map Reduce执行流程以及Hive执行insert overwrite 底层是怎么跑数据的_hive中overwrite 执行顺序‘’

hive中overwrite 执行顺序‘’

目录

MR简述

MR执行流程

Input阶段

Mapper阶段

Reducer阶段

例子

insert overwrite table

翻译日志

 


MR简述

MapReduce 作业通常将输入数据集分割成独立的块,这些块由 map 任务以完全并行的方式进行处理。MR框架对映射的输出进行排序,然后将其输入到 reduce 任务中。通常,作业的输入和输出都存储在文件系统中。该框架负责调度任务、监视任务并重新执行失败的任务。

通常,计算节点和存储节点是相同的,也就是说,MapReduce 框架和 Hadoop 分布式文件系统在同一组节点上运行。这种配置允许框架在数据已经存在的节点上有效地调度任务,从而产生跨集群的高聚合带宽。

MR执行流程

Input阶段

  • JobClient 输入输入文件的存储位置
  • JobClient 通过 InputFormat 接口指定分割的逻辑,默认是按照 HDFS 文件分隔,即有多少个数据块就有多少个 maps
  • Hadoop 再次把文件分割为 <key, value> 类型的数据
  • JobTracker 负责分配对应的数据块由对应的 mapper 处理,同时 RecordReader 负责读取 KV 值

Mapper阶段

  • Jobtracker 会将 maptask 分发给 TaskTracker 去执行
  • mapper 接收到数据之后,会在 Hadoop 的内存缓冲区(默认大小100M)中按照 key 进行排序分区操作,待内存写满之后,会写入磁盘
  • 将不同 key 的数据进行 partition 操作,即相同 key 的数据在同一个分区当中
  • 在单个分区内按照 key 进行排序 sort 操作
  • Combiner 阶段(可选),从所有 map 主机上把相同的 key 的 key value 对组合在一起,减小 reduce 的压力
  • map 将数据写出到文件系统

Reducer阶段

Reducer 有三个主要阶段:grouping,sortpartiton,reduce

  • Shuffle:reduce 通过 HTTP 获取所有 mapper 输出的分区数据
  • Grouping:Reduce 将不同 map 端的相同 key 的数据进行分组,因为不同的 mapper 可能输入相同 key 的数据
  • Sort:在 reduce 阶段进行,对 mapper 的数据会进行排序的操作,这个过程称为 sort

注意:Shuffle 和 Sort 同时进行,当获取 map 输出时,它们被合并。

  • Secondary Sort:如果使用  Job.setSortComparatorClass(Class),那么就会对数据进行自定义排序。由于  Job.setGroupingComparatorClass(Class) 可用于控制中间键的分组方式,因此可以将这些键与值结合使用来模拟二次排序
  • Reduce:使用 reduce(WritableComparable, Iterable<Writable>, Context) 方法遍历 <key, (list of values)> 类型的数据,最后将数据写入文件系统

Reduce 的数量可以由 Job.setNumReduceTasks(int) 指定。//数量必须比 partition 分区的数量大,不然会报错

例子

hive 任务的底层就是 MapReduce 任务

insert overwrite table

  1. 0: jdbc:hive2://hiveserver2.bigdata.chinatele> insert OVERWRITE table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d PARTITION (data_day)
  2. . . . . . . . . . . . . . . . . . . . . . . .> select * from sc_share_db.app_mbl_user_trmnl_trail_info_d a where a.data_day BETWEEN 20190619 AND 20191007;
  3. INFO : Compiling command(queryId=hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45): insert OVERWRITE table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d PARTITION (data_day)
  4. select * from sc_share_db.app_mbl_user_trmnl_trail_info_d a where a.data_day BETWEEN 20190619 AND 20191007
  5. INFO : Semantic Analysis Completed
  6. INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a.mdn, type:string, comment:null), FieldSchema(name:a.r_trmnl_brand, type:string, comment:null), FieldSchema(name:a.r_trmnl_model, type:string, comment:null), FieldSchema(name:a.r_use_day, type:string, comment:null), FieldSchema(name:a.d_trmnl_brand, type:string, comment:null), FieldSchema(name:a.d_trmnl_model, type:string, comment:null), FieldSchema(name:a.d_use_day, type:string, comment:null), FieldSchema(name:a.data_day, type:string, comment:null)], properties:null)
  7. INFO : Completed compiling command(queryId=hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45); Time taken: 0.412 seconds
  8. INFO : Concurrency mode is disabled, not creating a lock manager
  9. INFO : Executing command(queryId=hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45): insert OVERWRITE table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d PARTITION (data_day)
  10. select * from sc_share_db.app_mbl_user_trmnl_trail_info_d a where a.data_day BETWEEN 20190619 AND 20191007
  11. INFO : Query ID = hive_20191010113232_c82e18e3-5853-43c5-9856-d1d1c55dde45
  12. INFO : Total jobs = 3
  13. INFO : Launching Job 1 out of 3
  14. INFO : Starting task [Stage-1:MAPRED] in serial mode
  15. INFO : Number of reduce tasks is set to 0 since there's no reduce operator
  16. INFO : number of splits:237
  17. INFO : Submitting tokens for job: job_1569295562481_2677748
  18. INFO : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns4, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678407251, maxDate=1571283207251, sequenceNumber=99889585, masterKeyId=889)
  19. INFO : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns3, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678407264, maxDate=1571283207264, sequenceNumber=100362646, masterKeyId=873)
  20. INFO : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678406757, maxDate=1571283206757, sequenceNumber=381027444, masterKeyId=1165)
  21. INFO : Kind: HIVE_DELEGATION_TOKEN, Service: HiveServer2ImpersonationToken, Ident: 00 16 6a 74 5f 6a 74 73 6a 7a 78 73 6a 79 79 63 5f 73 63 5f 66 77 66 7a 16 6a 74 5f 6a 74 73 6a 7a 78 73 6a 79 79 63 5f 73 63 5f 66 77 66 7a 3f 68 69 76 65 2f 68 69 76 65 73 65 72 76 65 72 32 2e 62 69 67 64 61 74 61 2e 63 68 69 6e 61 74 65 6c 65 63 6f 6d 2e 63 6e 40 48 41 44 4f 4f 50 2e 43 48 49 4e 41 54 45 4c 45 43 4f 4d 2e 43 4e 8a 01 6d b3 a7 a2 5e 8a 01 6d d7 b4 26 5e 8e 77 aa 8e 19 30
  22. INFO : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ns2, Ident: (token for jt_jtsjzxsjyyc_sc_fwfz: HDFS_DELEGATION_TOKEN owner=jt_jtsjzxsjyyc_sc_fwfz, renewer=yarn, realUser=hive/hiveserver2.bigdata.chinatelecom.cn@HADOOP.CHINATELECOM.CN, issueDate=1570678407250, maxDate=1571283207250, sequenceNumber=110691977, masterKeyId=871)
  23. INFO : The url to track the job: http://NM-304-RH5885V3-BIGDATA-008:8088/proxy/application_1569295562481_2677748/
  24. INFO : Starting Job = job_1569295562481_2677748, Tracking URL = http://NM-304-RH5885V3-BIGDATA-008:8088/proxy/application_1569295562481_2677748/
  25. INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1569295562481_2677748
  26. INFO : Hadoop job information for Stage-1: number of mappers: 237; number of reducers: 0
  27. INFO : 2019-10-10 11:35:34,427 Stage-1 map = 0%, reduce = 0%
  28. INFO : 2019-10-10 11:36:18,032 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 10.65 sec
  29. INFO : 2019-10-10 11:36:19,080 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 96.18 sec
  30. INFO : 2019-10-10 11:36:20,131 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 224.8 sec
  31. INFO : 2019-10-10 11:36:21,181 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 342.52 sec
  32. INFO : 2019-10-10 11:36:22,243 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 444.37 sec
  33. INFO : 2019-10-10 11:36:23,304 Stage-1 map = 47%, reduce = 0%, Cumulative CPU 1014.68 sec
  34. INFO : 2019-10-10 11:36:24,582 Stage-1 map = 64%, reduce = 0%, Cumulative CPU 1638.8 sec
  35. ================================中间有一部分略
  36. INFO : 2019-10-10 11:36:37,280 Stage-1 map = 84%, reduce = 0%, Cumulative CPU 2181.11 sec
  37. INFO : 2019-10-10 11:36:48,674 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 2209.67 sec
  38. INFO : 2019-10-10 11:37:01,292 Stage-1 map = 90%, reduce = 0%, Cumulative CPU 2299.86 sec
  39. INFO : 2019-10-10 11:37:07,494 Stage-1 map = 92%, reduce = 0%, Cumulative CPU 2322.2 sec
  40. INFO : 2019-10-10 11:37:12,849 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 2335.47 sec
  41. INFO : 2019-10-10 11:37:13,886 Stage-1 map = 97%, reduce = 0%, Cumulative CPU 2363.0 sec
  42. INFO : 2019-10-10 11:37:14,922 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 2372.74 sec
  43. INFO : 2019-10-10 11:39:16,852 Stage-1 map = 99%, reduce = 0%, Cumulative CPU 2386.92 sec
  44. INFO : 2019-10-10 11:46:52,457 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2398.68 sec
  45. INFO : MapReduce Total cumulative CPU time: 39 minutes 58 seconds 680 msec
  46. INFO : Ended Job = job_1569295562481_2677748
  47. INFO : Starting task [Stage-7:CONDITIONAL] in serial mode
  48. INFO : Stage-4 is selected by condition resolver.
  49. INFO : Stage-3 is filtered out by condition resolver.
  50. INFO : Stage-5 is filtered out by condition resolver.
  51. INFO : Starting task [Stage-4:MOVE] in serial mode
  52. INFO : Moving data to: viewfs://ctccfs/user/hive_tmp/.hive-staging_hive_2019-10-10_11-32-41_413_3936369249909689263-4162/-ext-10000 from viewfs://ctccfs/user/hive_tmp/.hive-staging_hive_2019-10-10_11-32-41_413_3936369249909689263-4162/-ext-10002
  53. INFO : Starting task [Stage-0:MOVE] in serial mode
  54. INFO : Loading data to table jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d partition (data_day=null) from viewfs://ctccfs/user/hive_tmp/.hive-staging_hive_2019-10-10_11-32-41_413_3936369249909689263-4162/-ext-10000
  55. INFO : Time taken for load dynamic partitions : 37085
  56. INFO : Loading partition {data_day=20190915}
  57. INFO : Loading partition {data_day=20190624}
  58. INFO : Loading partition {data_day=20190906}
  59. INFO : Loading partition {data_day=20190902}
  60. INFO : Loading partition {data_day=20190909}
  61. ================================中间有一部分略
  62. INFO : Loading partition {data_day=20190901}
  63. INFO : Loading partition {data_day=20190916}
  64. INFO : Loading partition {data_day=20190908}
  65. INFO : Loading partition {data_day=20190723}
  66. INFO : Time taken for adding to write entity : 12
  67. INFO : Starting task [Stage-2:STATS] in serial mode
  68. INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191001} stats: [numFiles=1, numRows=106817, totalSize=5047510, rawDataSize=4940693]
  69. INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191002} stats: [numFiles=1, numRows=142186, totalSize=7349564, rawDataSize=7207378]
  70. INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191003} stats: [numFiles=1, numRows=146760, totalSize=7585261, rawDataSize=7438501]
  71. ================================中间有一部分略
  72. INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191004} stats: [numFiles=1, numRows=115010, totalSize=5880787, rawDataSize=5765777]
  73. INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191005} stats: [numFiles=1, numRows=128308, totalSize=6711669, rawDataSize=6583361]
  74. INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191006} stats: [numFiles=1, numRows=104644, totalSize=5418150, rawDataSize=5313506]
  75. INFO : Partition jt_jtsjzxsjyyc_sc_fwfz.app_mbl_user_trmnl_trail_info_d{data_day=20191007} stats: [numFiles=1, numRows=89627, totalSize=4577004, rawDataSize=4487377]
  76. INFO : MapReduce Jobs Launched:
  77. INFO : Stage-Stage-1: Map: 237 Cumulative CPU: 2398.68 sec HDFS Read: 21622480200 HDFS Write: 459476088 SUCCESS
  78. INFO : Total MapReduce CPU Time Spent: 39 minutes 58 seconds 680 msec
  79. INFO : Completed executing command(queryId=hive_2019101011 3232_c82e18e3-5853-43c5-9856-d1d1c55dde45); Time taken: 1008.946 seconds
  80. INFO : OK
  81. No rows affected (1009.374 seconds)

翻译日志

  1. 开始编译 insert 命令,得到一个 queryId
  2. 分析 hivesql,语义分析完成
  3. 返回 Hive schema:包含 FieldSchema(字段名称,字段类型,字段注释)、properties
  4. 编译完成,返回 queryId(与第一步的queryId相同)和编译耗时(以秒为单位)
  5. Info:禁用并发模式,不创建锁管理器
  6. 执行 insert 命令
  7. Info:queryId
  8. 3个job
  9. 启动job1
  10. 在串行模式下启动 task [Stage-1:MAPRED]
  11. 因为没有 reduce 算子,所以 reduce 任务的数量被设置为0
  12. splits 数量:237
  13. 作业令牌标识:job_1569295562481_2677748
  14. ...map
  15. 作业在 YARN 上的 url 路径:http://host:port/proxy/application_1569295562481_2677748/
  16. 开始任务:Job = job_jobName, Tracking URL = http://host:port/proxy/application_1569295562481_2677748/
  17. 杀死任务命令:/usr/lib/hadoop/bin/hadoop job  -kill job_jobName
  18. Stage-1 Hadoop 作业信息:mapper 数量237,reduce 数量0
  19. map 作业进度的百分比......包含(map=X%,reduce=X%,累计CPU xx秒)
  20. MapReduce 总累积 CPU 时间:39分58秒680毫秒
  21. Info:Job = job_Name
  22. 在串行模式下启动task [Stage-7:CONDITIONAL]
  23. 阶段4由条件解析器选择。
  24. 阶段3被条件分解器过滤掉了。
  25. 阶段5被条件分解器过滤掉了。
  26. 以串行方式启动任务[阶段4:移动]
  27. 将数据从源移动到结果端
  28. 在串行模式下启动任务[阶段-0:移动]
  29. 将 HDFS 中的数据加载到 Hive 表当中
  30. 加载动态分区所需的时间:37085
  31. Loading......
  32. 添加写入实体所需时间:12
  33. 在串行模式下启动task [Stage-2:STATS]
  34. Info:分区信息,包含(db.table{data_day=20190620},stats: [numFiles=1, numRows=222592, totalSize=11316154, rawDataSize=11093562])
  35. MapReduce 任务结束
  36. Stage-Stage-1: Map: 237   Cumulative CPU: 2398.68 sec   HDFS Read: 21622480200 HDFS Write: 459476088 SUCCESS
  37. 总MapReduce CPU时间花费:39 minutes 58 seconds 680 msec
  38. 完成执行命令,实际时间:1008.946 seconds
  39. 没有行受影响(总耗时:1009.374 seconds)
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/396341
推荐阅读
相关标签
  

闽ICP备14008679号