赞
踩
Fetch抓取是指hive在某些情况的查询可以不必使用MapReduce计算,例如在执行一个简单的select * from XX
时,我们只需要简单的进行抓取对应目录下的数据即可。
在hive-default.xml.template
中,hive.fetch.task.conversion
默认是morn,老版本中默认是minimal。
该属性为morn时,在全局查找,字段查找,limit查找等都不走MapReduce。
<property>
<name>hive.fetch.task.conversion</name>
<value>more</value>
<description>
Expects one of [none, minimal, more].
Some select queries can be converted to single FETCH task minimizing latency.
Currently the query should be single sourced not having any subquery and should not have
any aggregations or distincts (which incurs RS), lateral views and joins.
0. none : disable hive.fetch.task.conversion
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
</description>
</property>
我们来做个比较:
hive (default)> set hive.fetch.task.conversion=none; hive (default)> select * from emp ; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20200710194508_8561a541-492c-48f4-a381-95f041e276c8 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1594379941865_0001, Tracking URL = http://master:18088/proxy/application_1594379941865_0001/ Kill Command = /usr/hdk/hadoop/bin/hadoop job -kill job_1594379941865_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2020-07-10 19:45:29,124 Stage-1 map = 0%, reduce = 0% 2020-07-10 19:45:44,427 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.05 sec MapReduce Total cumulative CPU time: 3 seconds 50 msec Ended Job = job_1594379941865_0001 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 3.05 sec HDFS Read: 5489 HDFS Write: 916 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 50 msec OK emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno 7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20 7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30 7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30 7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20 7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30 7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30 7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10 7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20 7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10 7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30 7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20 7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30 7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20 7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10 Time taken: 37.739 seconds, Fetched: 14 row(s)
这是使用mr去查询的运行结果,运行了37s。
hive (default)> set hive.fetch.task.conversion=more; hive (default)> select * from emp ; OK emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno 7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20 7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30 7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30 7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20 7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30 7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30 7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10 7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20 7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10 7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30 7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20 7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30 7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20 7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10 Time taken: 0.289 seconds, Fetched: 14 row(s)
这样只需要0.2s。
我们在处理小数据集的时候可以使用本地模式来处理所有任务,这样可以明显的缩短执行时间。
本地模式要求数据量小于128MB(一个标准块大小),同时文件数小于4个(可调整)。
hive (default)> set hive.exec.mode.local.auto; hive.exec.mode.local.auto=false hive (default)> set hive.exec.mode.local.auto=true; hive (default)> set hive.exec.mode.local.auto.inputbytes.max; hive.exec.mode.local.auto.inputbytes.max=134217728 hive (default)> set hive.exec.mode.local.auto.input.files.max; hive.exec.mode.local.auto.input.files.max=4 hive (default)> select * from emp cluster by deptno; Automatically selecting local only mode for query WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20200710210745_10f9202c-678f-4ddb-aa2b-555551d11e1d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2020-07-10 21:07:46,989 Stage-1 map = 100%, reduce = 100% Ended Job = job_local897334446_0004 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 10848 HDFS Write: 6412 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno 7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10 7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10 7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10 7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20 7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20 7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20 7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20 7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20 7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30 7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30 7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30 7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30 7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30 7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30 Time taken: 1.559 seconds, Fetched: 14 row(s)
hive (default)> set hive.exec.mode.local.auto; hive.exec.mode.local.auto=false hive (default)> set hive.fetch.task.conversion; hive.fetch.task.conversion=more hive (default)> select * from emp cluster by deptno; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20200710210943_088e87fe-968d-4c11-868f-267a89bb5fef Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1594379941865_0003, Tracking URL = http://master:18088/proxy/application_1594379941865_0003/ Kill Command = /usr/hdk/hadoop/bin/hadoop job -kill job_1594379941865_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-07-10 21:09:53,706 Stage-1 map = 0%, reduce = 0% 2020-07-10 21:10:03,663 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.74 sec 2020-07-10 21:10:13,125 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.11 sec MapReduce Total cumulative CPU time: 5 seconds 110 msec Ended Job = job_1594379941865_0003 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.11 sec HDFS Read: 10306 HDFS Write: 916 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 110 msec OK emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno 7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10 7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10 7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10 7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20 7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20 7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20 7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20 7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20 7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30 7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30 7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30 7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30 7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30 7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30 Time taken: 30.785 seconds, Fetched: 14 row(s)
如果不开启本地模式,就要30s。
本地模式是搭配着hive.exec.mode.local.auto.inputbytes.max
和hive.exec.mode.local.auto.input.files.max
使用的。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。