赞
踩
在SQL中我们可以使用in、exists语句来判断对象是否存在某个子查询中,在pg中除了支持这两种语法,还支持= any的语法,例如下面三个SQL的意思就是一样的:
select * from tbl where id in (select id from t);
select * from tbl where exists (select 1 from t where t.id=tbl.id);
select * from tbl where id = any (array( select id from t ));
但是不同的写法,性能还是有一定的差距。可能对in、exists这两种写法比较熟悉的人都知道这么一个结论:IN适合于外表大而内表小的情况;EXISTS适合于外表小而内表大的情况。
那么在pg中是不是这样呢?这两种写法和= any的写法又有什么区别呢?
例子:
1、创建测试表
bill@bill=> create table t(id int);
CREATE TABLE
bill@bill=> insert into t select generate_series(1,100*10000);
INSERT 0 1000000
2、三者执行计划对比
可以发现,= any是InitPlan,而另外两种写法是SubPlan。
bill@bill=>explain select n = any(array(select id from test)) from generate_series(1,10000) as n; QUERY PLAN ---------------------------------------------------------------------------------- Function Scan on generate_series n (cost=14425.00..14650.00 rows=10000 width=1) InitPlan 1 (returns $0) -> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4) (3 rows) bill@bill=>explain select n in (select id from test) from generate_series(1,10000) as n; QUERY PLAN ---------------------------------------------------------------------------------- Function Scan on generate_series n (cost=0.00..129160125.00 rows=10000 width=1) SubPlan 1 -> Materialize (cost=0.00..23332.00 rows=1000000 width=4) -> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4) (4 rows) bill@bill=>explain select exists (select 1 from test where test.id=n.n) from generate_series(1,10000) as n; QUERY PLAN ---------------------------------------------------------------------------------- Function Scan on generate_series n (cost=0.00..169250100.00 rows=10000 width=1) SubPlan 1 -> Seq Scan on test (cost=0.00..16925.00 rows=1 width=0) Filter: (id = n.n) SubPlan 2 -> Seq Scan on test test_1 (cost=0.00..14425.00 rows=1000000 width=4) (6 rows)
3、对于IN的写法,work_mem参数会直接影响性能,work_mem的大小决定了subquery是否要装载到hash table。
bill@bill=>set work_mem ='1MB'; SET bill@bill=>explain select n in (select id from test) from generate_series(1,10000) as n; QUERY PLAN ---------------------------------------------------------------------------------- Function Scan on generate_series n (cost=0.00..129160125.00 rows=10000 width=1) SubPlan 1 -> Materialize (cost=0.00..23332.00 rows=1000000 width=4) -> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4) (4 rows) bill@bill=>set work_mem ='100MB'; SET bill@bill=>explain select n in (select id from test) from generate_series(1,10000) as n; QUERY PLAN ---------------------------------------------------------------------------------- Function Scan on generate_series n (cost=16925.00..17050.00 rows=10000 width=1) SubPlan 1 -> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4) (3 rows)
对于三种写法优化器实现的差异,在代码里面有相关介绍,详见:backend/optimizer/plan/subselect.c
190 * For an EXISTS subplan, tell lower-level planner to expect that only the 191 * first tuple will be retrieved. For ALL and ANY subplans, we will be 192 * able to stop evaluating if the test condition fails or matches, so very 193 * often not all the tuples will be retrieved; for lack of a better idea, 194 * specify 50% retrieval. For EXPR, MULTIEXPR, and ROWCOMPARE subplans, 195 * use default behavior (we're only expecting one row out, anyway). 196 * 197 * NOTE: if you change these numbers, also change cost_subplan() in 198 * path/costsize.c. 199 * 200 * XXX If an ANY subplan is uncorrelated, build_subplan may decide to hash 201 * its output. In that case it would've been better to specify full 202 * retrieval. At present, however, we can only check hashability after 203 * we've made the subplan :-(. (Determining whether it'll fit in work_mem 204 * is the really hard part.) Therefore, we don't want to be too 205 * optimistic about the percentage of tuples retrieved, for fear of 206 * selecting a plan that's bad for the materialization case. 207 */
in、exists、= any性能对比
1、in, work_mem装不下subquery
bill@bill=>set work_mem ='1MB';
SET
bill@bill=>explain select n in (select id from test) from generate_series(1,10000) as n;
QUERY PLAN
----------------------------------------------------------------------------------
Function Scan on generate_series n (cost=0.00..129160125.00 rows=10000 width=1)
SubPlan 1
-> Materialize (cost=0.00..23332.00 rows=1000000 width=4)
-> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4)
(4 rows)
2、in, work_mem装的下subquery
bill@bill=>set work_mem ='100MB';
SET
bill@bill=>explain select n in (select id from test) from generate_series(1,10000) as n;
QUERY PLAN
----------------------------------------------------------------------------------
Function Scan on generate_series n (cost=16925.00..17050.00 rows=10000 width=1)
SubPlan 1
-> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4)
(3 rows)
3、= any
这种写法work_mem很小也无所谓,因为不涉及hashtable
bill@bill=>set work_mem ='64kB'; SET bill@bill=>explain (analyze,verbose,timing,costs,buffers) select n = any(array(select id from test)) from generate_series(1,10000) as n; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Function Scan on pg_catalog.generate_series n (cost=14425.00..14650.00 rows=10000 width=1) (actual time=188.203..529.799 rows=10000 loops=1) Output: (n.n = ANY ($0)) Function Call: generate_series(1, 10000) Buffers: shared hit=4425, temp read=18 written=18 InitPlan 1 (returns $0) -> Seq Scan on bill.test (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.014..97.379 rows=1000000 loops=1) Output: test.id Buffers: shared hit=4425 Planning Time: 0.068 ms Execution Time: 531.720 ms (10 rows)
4、exists
work_mem需求量较少(exists由于优化器在匹配到1条后即刻返回,所以会选择使用索引,性能就非常好。)
bill@bill=>create index idx_test1 on test(id); CREATE INDEX bill@bill=>set work_mem ='64KB'; bill@bill=>explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from test where test.id=n.n) from generate_series(1,10000) as n; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Function Scan on pg_catalog.generate_series n (cost=0.00..30525.00 rows=10000 width=1) (actual time=2.840..35.937 rows=10000 loops=1) Output: (SubPlan 1) Function Call: generate_series(1, 10000) Buffers: shared hit=39970 read=30, temp read=18 written=18 SubPlan 1 -> Index Only Scan using idx_test1 on bill.test (cost=0.42..3.04 rows=1 width=0) (actual time=0.003..0.003 rows=1 loops=10000) Index Cond: (test.id = n.n) Heap Fetches: 10000 Buffers: shared hit=39970 read=30 Planning Time: 0.170 ms Execution Time: 37.032 ms (11 rows) bill@bill=>set work_mem ='64MB'; SET bill@bill=>explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from test where test.id=n.n) from generate_series(1,10000) as n; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Function Scan on pg_catalog.generate_series n (cost=0.00..30525.00 rows=10000 width=1) (actual time=430.974..434.543 rows=10000 loops=1) Output: (alternatives: SubPlan 1 or hashed SubPlan 2) Function Call: generate_series(1, 10000) Buffers: shared hit=4425 SubPlan 1 -> Index Only Scan using idx_test1 on bill.test (cost=0.42..3.04 rows=1 width=0) (never executed) Index Cond: (test.id = n.n) Heap Fetches: 0 SubPlan 2 -> Seq Scan on bill.test test_1 (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.023..99.501 rows=1000000 loops=1) Output: test_1.id Buffers: shared hit=4425 Planning Time: 0.115 ms Execution Time: 436.932 ms (14 rows)
5、exists没有索引
如果删掉索引,exists性能就会下降了,同时性能也和是否使用哈希表有关。
bill@bill=>drop index idx_test1 ; DROP INDEX bill@bill=>set work_mem ='64kB'; SET bill@bill=>explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from test where test.id=n.n) from generate_series(1,10000) as n; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Function Scan on pg_catalog.generate_series n (cost=0.00..169250100.00 rows=10000 width=1) (actual time=1.140..3917.927 rows=10000 loops=1) Output: (SubPlan 1) Function Call: generate_series(1, 10000) Buffers: shared hit=226260, temp read=18 written=18 SubPlan 1 -> Seq Scan on bill.test (cost=0.00..16925.00 rows=1 width=0) (actual time=0.391..0.391 rows=1 loops=10000) Filter: (test.id = n.n) Rows Removed by Filter: 5000 Buffers: shared hit=226260 Planning Time: 0.127 ms Execution Time: 3919.068 ms (11 rows) bill@bill=>set work_mem ='64MB'; SET bill@bill=>explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from test where test.id=n.n) from generate_series(1,10000) as n; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Function Scan on pg_catalog.generate_series n (cost=0.00..169250100.00 rows=10000 width=1) (actual time=430.988..434.781 rows=10000 loops=1) Output: (alternatives: SubPlan 1 or hashed SubPlan 2) Function Call: generate_series(1, 10000) Buffers: shared hit=4425 SubPlan 1 -> Seq Scan on bill.test (cost=0.00..16925.00 rows=1 width=0) (never executed) Filter: (test.id = n.n) SubPlan 2 -> Seq Scan on bill.test test_1 (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.022..100.982 rows=1000000 loops=1) Output: test_1.id Buffers: shared hit=4425 Planning Time: 0.086 ms Execution Time: 437.278 ms (13 rows)
总结:
1、= any这种写法,不会走subplan,因此不涉及hash table的问题。和work_mem设置大小无关。因此在pg中十分推荐使用这种写法。
2、exists,由于优化器会默认它只需要搜索到1条命中目标就不搜了,所以优化器评估是否使用hash table时,需要的内存相对较少,即使较小的work_mem也可能使用hashtable。
3、in ,当出现在subquery中时,优化器评估这个subquery是否要构建哈希TABLE,直接和subquery的大小相关,所以需要较大的work_mem才会选择使用hashtable。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。