PostgreSQL の実行計画で Hash のコストは Hash Join に含まれる

PostgreSQL の実行計画で Hash ノードのコストは 0 と計算されています。

PostgreSQL 13.3 での実行計画例を示します。

testdb=# CREATE TABLE tab1(c1 integer);
testdb=# CREATE TABLE tab2(c1 integer);

testdb=# INSERT INTO tab1(c1) SELECT generate_series(1, 1000000);
testdb=# INSERT INTO tab2(c1) SELECT generate_series(1, 1000000);

testdb=# VACUUM ANALYZE;

testdb=# EXPLAIN ANALYZE SELECT * FROM tab1, tab2 WHERE tab1.c1 = tab2.c1;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=30832.00..70728.00 rows=1000000 width=8) (actual time=179.725..660.696 rows=1000000 loops=1)
   Hash Cond: (tab1.c1 = tab2.c1)
   ->  Seq Scan on tab1  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.006..53.789 rows=1000000 loops=1)
   ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=4) (actual time=179.167..179.167 rows=1000000 loops=1)
         Buckets: 131072  Batches: 16  Memory Usage: 3227kB
         ->  Seq Scan on tab2  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.007..55.604 rows=1000000 loops=1)
 Planning Time: 0.192 ms
 Execution Time: 687.420 ms

Hash の初期推定コスト及び全体推定コストが子ノードの Seq Scan の全体推定コストと等しくなっています。上位ノードのコストは全ての子ノードのコストを含めることを踏まえると、Hash ノードのコストは加算されていないことが分かります。

Hash の処理にも時間がかかっていることは Seq Scan と Hash の actual time を比較して増えていることからも明らかです。このコストを実行計画に組み込む必要は無いのでしょうか？

実はハッシュ処理のコストは Hash Join のコストとして組み込まれています。

PostgreSQL 13.3 のソースコードから該当部分を抜粋します。

src/backend/optimizer/path/costsize.c
initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace, JoinType jointype, List *hashclauses, Path *outer_path, Path *inner_path, JoinPathExtraData *extra, bool parallel_hash)

/*
 * Cost of computing hash function: must do it once per input tuple. We
 * charge one cpu_operator_cost for each column's hash function.  Also,
 * tack on one cpu_tuple_cost per inner row, to model the costs of
 * inserting the row into the hashtable.
 *
 * XXX when a hashclause is more complex than a single operator, we really
 * should charge the extra eval costs of the left or right side, as
 * appropriate, here.  This seems more work than it's worth at the moment.
 */
startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost)
	* inner_path_rows;
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;

Hash のコストは Hash Join に含まれており、全体としてはハッシュ処理のコストも踏まえた実行計画になっていることが分かりました。

ぱと隊長日誌

ブログ運用もエンジニアとしての生き方も模索中

PostgreSQL の実行計画で Hash のコストは Hash Join に含まれる