导航：首页 > 数据库 >

PostgreSQL JOIN limit 优化器成本计算改进 - mergejoin startup cost 优化

发表于：2025-11-10 作者：千家信息网编辑

千家信息网最后更新 2025年11月10日，背景PostgreSQL limit N的成本估算，是通过计算总成本A，以及估算得到的总记录数B得到：(N/B)*A 大概意思就是占比的方法计算对于单表查询，这种方法通常来说比较适用，但是如果数

千家信息网最后更新 2025年11月10日PostgreSQL JOIN limit 优化器成本计算改进 - mergejoin startup cost 优化

背景

PostgreSQL limit N的成本估算，是通过计算总成本A，以及估算得到的总记录数B得到：

(N/B)*A    大概意思就是占比的方法计算

对于单表查询，这种方法通常来说比较适用，但是如果数据分布有倾斜，实际上也并不一定适用，例如以下两种情况：

1、符合条件的数据占总记录数的50%，但是全部分布在表的末尾，那么limit 10000 条到底是走索引快还是走全表扫描快呢？

2、符合条件的数据占总记录数的50%，全部分布在表的头部，那么LIMIT 10000 条，肯定是全表扫描快了。

对于JOIN的情况，同样有类似的问题：

比如JOIN并且带条件时，LIMIT N，是走嵌套循环快，还是走MERGE 或 HASH JOIN快？

1、嵌套循环+where+LIMIT的成本计算方法，可以使用LIMIT占总估算记录数占比的方法得到，还算是比较合理。

2、MERGE JOIN+where+LIMIT的成本计算方法，必须考虑启动成本，例如WHERE条件在A表上（可以走索引直接从条件位置开始扫描），B表则需要从索引的开头开始扫描（到与A表的索引匹配时，也许需要扫描很多的索引ENTRY，这个启动成本可能会很高），启动成本，加上LIMIT条数在剩余的所有成本中的一个占比，得到的成本是一个比较合理的成本。

3、hash join+where+limit的成本计算方法，使用启动成本+LIMIT占总估算记录数占比的方法得到，优化器目前就是这么做的，比较合理。

然而，对于MERGE JOIN，目前在使用LIMIT时，PG没有加上这个启动成本，使得最后得到的执行计划可能不准确。

改进方法建议可以加入merge join启动成本。

PostgreSQL 例子

1、建表如下：

postgres=# create table test1(a int, b text, primary key(a));  CREATE TABLE  postgres=# create table test2(a int, b text, primary key(a));  CREATE TABLE  postgres=# alter table test1 add constraint testcheck foreign key(a) references test2(a);                                                                    ALTER TABLE  postgres=# insert into test2 select generate_series(1,1000000),'abcdefg';  INSERT 0 1000000  postgres=# insert into test1 select generate_series(1,1000000,2),'abcdefg';  INSERT 0 500000    analyze test1;  analyze test2;

2、查询SQL如下：

explain (analyze,verbose,timing,costs,buffers) select * from test2 left join test1 on test2.a = test1.a where test2.a > 500000 limit 10;

该语句中表结构比较特殊，两个关联字段都是主键，并且存在外键约束关系，查询计划如下：

                                                                     QUERY PLAN                                                                       ----------------------------------------------------------------------------------------------------------------------------------------------------   Limit  (cost=0.73..0.89 rows=10 width=24) (actual time=54.729..54.739 rows=10 loops=1)     Output: test2.a, test2.b, test1.a, test1.b     Buffers: shared hit=2042     ->  Merge Left Join  (cost=0.73..7929.35 rows=498340 width=24) (actual time=54.728..54.735 rows=10 loops=1)           Output: test2.a, test2.b, test1.a, test1.b           Inner Unique: true           Merge Cond: (test2.a = test1.a)           Buffers: shared hit=2042           ->  Index Scan using test2_pkey on public.test2  (cost=0.37..3395.42 rows=498340 width=12) (actual time=0.017..0.020 rows=10 loops=1)                 Output: test2.a, test2.b                 Index Cond: (test2.a > 500000)                 Buffers: shared hit=4           ->  Index Scan using test1_pkey on public.test1  (cost=0.37..2322.99 rows=500000 width=12) (actual time=0.006..34.120 rows=250006 loops=1)                 Output: test1.a, test1.b                 Buffers: shared hit=2038   Planning Time: 0.216 ms   Execution Time: 54.765 ms  (17 rows)

从执行计划上可以看出，根据test2表先查询出满足条件的10条记录，然后和test1表采用mergejoin关联，由于在估算的时候没有考虑到limit的影响，估算的行数非常大，是498340行，

实际采用nestloop效果会更好（关闭掉seqscan和megejoin）

postgres=# set enable_seqscan =off;   SET  postgres=# set enable_mergejoin =off;   SET  postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test2 left join test1 on test2.a = test1.a where test2.a > 500000 limit 10;                                                                    QUERY PLAN                                                                     -----------------------------------------------------------------------------------------------------------------------------------------------   Limit  (cost=0.73..4.53 rows=10 width=24) (actual time=0.040..0.060 rows=10 loops=1)     Output: test2.a, test2.b, test1.a, test1.b     Buffers: shared hit=39     ->  Nested Loop Left Join  (cost=0.73..189339.64 rows=498340 width=24) (actual time=0.039..0.057 rows=10 loops=1)           Output: test2.a, test2.b, test1.a, test1.b           Inner Unique: true           Buffers: shared hit=39           ->  Index Scan using test2_pkey on public.test2  (cost=0.37..3395.42 rows=498340 width=12) (actual time=0.025..0.027 rows=10 loops=1)                 Output: test2.a, test2.b                 Index Cond: (test2.a > 500000)                 Buffers: shared hit=4           ->  Index Scan using test1_pkey on public.test1  (cost=0.37..0.37 rows=1 width=12) (actual time=0.002..0.002 rows=0 loops=10)                 Output: test1.a, test1.b                 Index Cond: (test2.a = test1.a)                 Buffers: shared hit=35   Planning Time: 0.112 ms   Execution Time: 0.078 ms  (17 rows)

但是从评估的成本来看，merge join+limit 比 nestloop+limit更低，原因是nestloop的总成本更高(189339 比 7929)。所以优化器根据比例算法（未参照merge join的启动成本），认为在LIMIT的情况下，也是merge join成本更低。

实际情况是，MERGE JOIN的没带查询条件的B表，需要从索引的头部开始扫，而不是从指定位置开始扫。因此实际情况是merge join是更慢的。

目前优化器使用hash join时，已经算上了startup cost，例子

postgres=# set enable_mergejoin =off;SETpostgres=# set enable_seqscan =off;SETpostgres=# set enable_nestloop =off;SET 启动成本=3536.51postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test2 left join test1 on test2.a = test1.a where test2.a > 500000 limit 10;                                                                        QUERY PLAN                                                                        ---------------------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=3536.51..3536.61 rows=10 width=24) (actual time=158.148..158.219 rows=10 loops=1)   Output: test2.a, test2.b, test1.a, test1.b   Buffers: shared hit=4079, temp written=1464   ->  Hash Left Join  (cost=3536.51..8135.83 rows=494590 width=24) (actual time=158.147..158.215 rows=10 loops=1)         Output: test2.a, test2.b, test1.a, test1.b         Inner Unique: true         Hash Cond: (test2.a = test1.a)         Buffers: shared hit=4079, temp written=1464         ->  Index Scan using test2_pkey on public.test2  (cost=0.37..3369.86 rows=494590 width=12) (actual time=0.023..0.027 rows=26 loops=1)               Output: test2.a, test2.b               Index Cond: (test2.a > 500000)               Buffers: shared hit=4         ->  Hash  (cost=2322.99..2322.99 rows=500000 width=12) (actual time=156.848..156.849 rows=500000 loops=1)               Output: test1.a, test1.b               Buckets: 262144  Batches: 4  Memory Usage: 7418kB               Buffers: shared hit=4072, temp written=1464               ->  Index Scan using test1_pkey on public.test1  (cost=0.37..2322.99 rows=500000 width=12) (actual time=0.011..72.506 rows=500000 loops=1)                     Output: test1.a, test1.b                     Buffers: shared hit=4072 Planning Time: 0.141 ms Execution Time: 162.086 ms(21 rows)

改进建议

针对test1表，需要估算a<500000有多少行，作为索引扫描的startup成本。

postgres=# explain select * from test1 where a<500000;                                     QUERY PLAN                                      ---------------------------------------------------------------------------------   Index Scan using test1_pkey on test1  (cost=0.37..1702.83 rows=249893 width=12)     Index Cond: (a < 500000)  (2 rows)      postgres=# explain select * from test1;                           QUERY PLAN                            -------------------------------------------------------------   Seq Scan on test1  (cost=0.00..133.15 rows=500000 width=12)  (1 row)

所以，索引扫描test1（where a > 500000）的merge join启动成本应该有 1702，加上这个成本后，成本远大于NEST LOOP JOIN的成本，所以不会选择merge join。

Oracle 例子

create table test1(a int, b varchar2(4000), primary key(a));    create table test2(a int, b varchar2(4000), primary key(a));    alter table test1 add constraint testcheck foreign key(a) references test2(a);                                                                        insert into test2 select rownum, 'abcdefg' from dual connect by level <=1000000;      insert into test1 select * from (select rownum as rn, 'abcdefg' from dual connect by level <=1000000) t where mod(rn,2)=1;

exec DBMS_STATS.GATHER_TABLE_STATS('DIGOAL', 'TEST1', method_opt => 'FOR COLUMNS (a, b)');   exec DBMS_STATS.GATHER_TABLE_STATS('DIGOAL', 'TEST2', method_opt => 'FOR COLUMNS (a, b)');

查询SQL如下：

set autotrace on  set linesize 120  set pagesize 200  set wrap off    select * from test2 left join test1 on test2.a = test1.a where test2.a > 500000 and rownum<=10;               A B  ---------- -------------------------------------------------------------------------------------------------------------      500001 abcdefg      500002 abcdefg      500003 abcdefg      500004 abcdefg      500005 abcdefg      500006 abcdefg      500007 abcdefg      500008 abcdefg      500009 abcdefg      500010 abcdefg    10 rows selected.      Execution Plan  ----------------------------------------------------------  Plan hash value: 3391785554    -----------------------------------------------------------------------------------------------  | Id  | Operation                     | Name          | Rows  | Bytes | Cost (%CPU)| Time     |  -----------------------------------------------------------------------------------------------  |   0 | SELECT STATEMENT              |               |    10 |   500 |    15   (0)| 00:00:01 |  |*  1 |  COUNT STOPKEY                |               |       |       |            |          |  |   2 |   NESTED LOOPS OUTER          |               |    10 |   500 |    15   (0)| 00:00:01 |  |   3 |    TABLE ACCESS BY INDEX ROWID| TEST2         |    10 |   250 |     4   (0)| 00:00:01 |  |*  4 |     INDEX RANGE SCAN          | SYS_C00151146 |  9000 |       |     3   (0)| 00:00:01 |  |   5 |    TABLE ACCESS BY INDEX ROWID| TEST1         |     1 |    25 |     2   (0)| 00:00:01 |  |*  6 |     INDEX UNIQUE SCAN         | SYS_C00151145 |     1 |       |     1   (0)| 00:00:01 |  -----------------------------------------------------------------------------------------------    Predicate Information (identified by operation id):  ---------------------------------------------------       1 - filter(ROWNUM<=10)     4 - access("TEST2"."A">500000)     6 - access("TEST2"."A"="TEST1"."A"(+))         filter("TEST1"."A"(+)>500000)      Statistics  ----------------------------------------------------------            0  recursive calls            0  db block gets           25  consistent gets            0  physical reads            0  redo size          937  bytes sent via SQL*Net to client          500  bytes received via SQL*Net from client            2  SQL*Net roundtrips to/from client            0  sorts (memory)            0  sorts (disk)           10  rows processed

Oracle 选择了nestloop join。

使用HINT，让Oracle使用merge join，看看成本是多少，是否与修正PostgreSQL merge join启动成本接近。

select /*+ USE_MERGE(test2,test1) */ * from test2 left join test1 on test2.a = test1.a where test2.a > 500000 and rownum<=10;  Execution Plan----------------------------------------------------------Plan hash value: 492577188------------------------------------------------------------------------------------------------| Id  | Operation                      | Name          | Rows  | Bytes | Cost (%CPU)| Time     |------------------------------------------------------------------------------------------------|   0 | SELECT STATEMENT               |               |    10 |   750 |    29   (7)| 00:00:01 ||*  1 |  COUNT STOPKEY                 |               |       |       |            |          ||   2 |   MERGE JOIN OUTER             |               |    10 |   750 |    29   (7)| 00:00:01 ||   3 |    TABLE ACCESS BY INDEX ROWID | TEST2         |    10 |   250 |     4   (0)| 00:00:01 ||*  4 |     INDEX RANGE SCAN           | SYS_C00151146 |  9000 |       |     3   (0)| 00:00:01 ||*  5 |    SORT JOIN                   |               | 25000 |   610K|    25   (8)| 00:00:01 ||   6 |     TABLE ACCESS BY INDEX ROWID| TEST1         | 25000 |   610K|    23   (0)| 00:00:01 ||*  7 |      INDEX RANGE SCAN          | SYS_C00151145 |  4500 |       |    11   (0)| 00:00:01 |------------------------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter(ROWNUM<=10)   4 - access("TEST2"."A">500000)   5 - access("TEST2"."A"="TEST1"."A"(+))       filter("TEST2"."A"="TEST1"."A"(+))   7 - access("TEST1"."A"(+)>500000)Statistics----------------------------------------------------------          1  recursive calls          0  db block gets       1099  consistent gets          0  physical reads          0  redo size        937  bytes sent via SQL*Net to client        500  bytes received via SQL*Net from client          2  SQL*Net roundtrips to/from client          1  sorts (memory)          0  sorts (disk)         10  rows processed

小结

1、PostgreSQL 在计算merge join+limit的成本时，优化器有优化的空间，可以考虑把启动成本算进来，提高优化器选择带limit输出的SQL的JOIN方法的正确性。

2、如果是inner join，通过query rewrite可以对merge join进行优化，跳过不符合条件的头部INDEX SCAN。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test2 join test1 on test2.a = test1.a where test2.a > 500000 limit 10;                                                                     QUERY PLAN                                                                     ---------------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=0.77..1.09 rows=10 width=24) (actual time=54.626..54.638 rows=10 loops=1)   Output: test2.a, test2.b, test1.a, test1.b   Buffers: shared hit=2042   ->  Merge Join  (cost=0.77..7895.19 rows=247295 width=24) (actual time=54.625..54.635 rows=10 loops=1)         Output: test2.a, test2.b, test1.a, test1.b         Inner Unique: true         Merge Cond: (test2.a = test1.a)         Buffers: shared hit=2042         ->  Index Scan using test2_pkey on public.test2  (cost=0.37..3369.86 rows=494590 width=12) (actual time=0.017..0.020 rows=19 loops=1)               Output: test2.a, test2.b               Index Cond: (test2.a > 500000)               Buffers: shared hit=4         ->  Index Scan using test1_pkey on public.test1  (cost=0.37..2322.99 rows=500000 width=12) (actual time=0.008..34.009 rows=250010 loops=1)               Output: test1.a, test1.b               Buffers: shared hit=2038 Planning Time: 0.244 ms Execution Time: 54.669 ms(17 rows)  sql rewrite:  可以做到内核里面，这样就不需要改SQL了。效果如下，超好。postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test2 join test1 on test2.a = test1.a where test2.a > 500000 and test1.a > 500000limit 10;                                                                  QUERY PLAN                                                                   ----------------------------------------------------------------------------------------------------------------------------------------------- Limit  (cost=0.75..1.30 rows=10 width=24) (actual time=0.035..0.048 rows=10 loops=1)   Output: test2.a, test2.b, test1.a, test1.b   Buffers: shared hit=8   ->  Merge Join  (cost=0.75..6711.51 rows=123700 width=24) (actual time=0.034..0.044 rows=10 loops=1)         Output: test2.a, test2.b, test1.a, test1.b         Inner Unique: true         Merge Cond: (test2.a = test1.a)         Buffers: shared hit=8         ->  Index Scan using test2_pkey on public.test2  (cost=0.37..3369.86 rows=494590 width=12) (actual time=0.015..0.019 rows=19 loops=1)               Output: test2.a, test2.b               Index Cond: (test2.a > 500000)               Buffers: shared hit=4         ->  Index Scan using test1_pkey on public.test1  (cost=0.37..1704.30 rows=250106 width=12) (actual time=0.015..0.017 rows=10 loops=1)               Output: test1.a, test1.b               Index Cond: (test1.a > 500000)               Buffers: shared hit=4 Planning Time: 0.276 ms Execution Time: 0.074 ms(18 rows)