{"id":465875,"date":"2025-07-03T09:39:07","date_gmt":"2025-07-03T09:39:07","guid":{"rendered":"http:\/\/savepearlharbor.com\/?p=465875"},"modified":"-0001-11-30T00:00:00","modified_gmt":"-0001-11-29T21:00:00","slug":"","status":"publish","type":"post","link":"https:\/\/savepearlharbor.com\/?p=465875","title":{"rendered":"<span>Automated management of extended statistics in PostgreSQL<\/span>"},"content":{"rendered":"<div><!--[--><!--]--><\/div>\n<div id=\"post-content-body\">\n<div>\n<div class=\"article-formatted-body article-formatted-body article-formatted-body_version-2\">\n<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<h4>Extended statistics \u2014 what is it?<\/h4>\n<p>The\u00a0extended statistics\u00a0feature lets you tell PostgreSQL to collect additional statistics on a set of table columns. Why would you need that? Let me quickly explain using the <a href=\"https:\/\/datasets.wri.org\/dataset\/globalpowerplantdatabase\" rel=\"noopener noreferrer nofollow\">public power plant dataset<\/a>. For instance, the\u00a0<code>primary_fuel<\/code>\u00a0used by a power plant is implicitly tied to the\u00a0<code>country<\/code>\u00a0column. So, if you run a simple query like:<\/p>\n<pre><code class=\"sql\">SELECT count(*) FROM power_plants WHERE country = '&lt;XXX&gt;' AND primary_fuel = 'Solar';<\/code><\/pre>\n<p>You\u2019ll see that Norway returns 0 rows, while Spain returns 243. That\u2019s obvious to us \u2014 latitude, climate, etc. \u2014 but the database doesn\u2019t know that. During query planning, PostgreSQL estimates the row count incorrectly: 93 for Norway and 253 for Spain. In a more complex query, where this estimate feeds into a JOIN or some other operator, the consequences can be unfortunate. Extended statistics calculate joint value distributions across column sets and help detect these dependencies.<\/p>\n<p>ORMs often make things worse. In this same power plant dataset, you might see filters on\u00a0<code>country<\/code>\u00a0and\u00a0<code>country_long<\/code>. From their descriptions, it\u2019s clear these two fields are directly related. But when an ORM generates a GROUP BY on both fields, the estimation fails dramatically:<\/p>\n<pre><code class=\"sql\">EXPLAIN (ANALYZE, COSTS ON, TIMING OFF, BUFFERS OFF, SUMMARY OFF) SELECT country, country_long FROM power_plants GROUP BY country, country_long;  HashAggregate  (rows=3494 width=16) (actual rows=167.00 loops=1)   Group Key: country, country_long   -&gt;  Seq Scan on power_plants  (rows=34936 width=16)                                 (actual rows=34936.00 loops=1)<\/code><\/pre>\n<p>A human would never write that query, but in the age of AI and auto-generated queries, this is the kind of mess we need to deal with.<\/p>\n<p>So what does\u00a0extended statistics\u00a0actually offer? It lets you define three types of stats on a column set:\u00a0MCV\u00a0(Most Common Values),\u00a0ndistinct, and\u00a0dependencies.<\/p>\n<p>For scan filters,\u00a0MCV\u00a0is the most effective: if the value combo you\u2019re filtering on is common in the table, the planner will estimate it accurately. If the combo is rare (like solar power in Norway), then PostgreSQL uses rough estimates like\u00a0<code>ntuples \/ ndistinct<\/code>, but can improve them by excluding values from the MCV list.<\/p>\n<p>For estimating group counts (GROUP BY, DISTINCT, IncrementalSort, Memoize, Hash Join), the\u00a0<code>ndistinct<\/code>\u00a0stat on column combinations is very helpful.<\/p>\n<p>To see the effect of extended stats, let\u2019s apply it to our example:<\/p>\n<pre><code class=\"sql\">CREATE STATISTICS ON country,primary_fuel FROM power_plants; ANALYZE;<\/code><\/pre>\n<p>Now the earlier query gives much more accurate cardinality estimates for filtering and grouping by these two fields. For Norway, it estimates 1 row; for Spain, 253 \u2014 feel free to try it with\u00a0<code>country = 'RUS'<\/code>\u00a0or\u00a0<code>'AUT'<\/code>. Sure, the table isn\u2019t that big, but the tool clearly works.<\/p>\n<p>Yet I rarely see\u00a0extended statistics\u00a0used in the wild. Probably one reason is fear that\u00a0<code>ANALYZE<\/code>\u00a0will take too long. But more likely, it\u2019s the complexity of knowing when and where to create the stats.<\/p>\n<h3>Finding the right stats definition<\/h3>\n<p>Are there any empirical rules for when and what kind of stats to create? I came up with two rules of thumb:<\/p>\n<p><strong>Rule 1 is based on index definitions<\/strong>. If a DBA risks creating an index on a set of columns, it probably means queries will often filter on those columns. And query performance on those filters is critical \u2014 a good reason to improve plan quality.<\/p>\n<p>Of course, not every multi-column filter suffers from bad estimates. That\u2019s a drawback of this rule \u2014 we might end up creating useless stats. And if the expected use case is point lookups, a small estimation error (1 vs 5 rows) doesn\u2019t matter much.<\/p>\n<p><strong>Rule 2 based on real-world filter patterns<\/strong>. Here, we pick candidate queries using two factors: (1) how much load the query puts on the DB (e.g.\u00a0<a href=\"https:\/\/habr.com\/ru\/companies\/postgrespro\/articles\/914698\/\" rel=\"noopener noreferrer nofollow\">pages-read<\/a>), and (2) whether it contains multi-column scan filters. It would also be nice to only consider cases where the actual row count differs significantly from the estimate.<\/p>\n<p>This rule is more selective \u2014 it reduces the number of generated stats definitions. But it raises tricky questions:<\/p>\n<ul>\n<li>\n<p>When to create stats?\u00a0With Rule 1, you create stats when you create the index. With Rule 2, you either use a timer to collect queries, or trigger it manually. The absence of a heavy bonus-calculating query for 29 days doesn\u2019t mean we don\u2019t want it fast on day 30. Even if its load is low, the accountant might not want to wait hours for it to run.<\/p>\n<\/li>\n<li>\n<p>How to clean up old stats?\u00a0With Rule 1, stats disappear when you drop the index. With Rule 2, it\u2019s not so clear. A query might stop showing up because the sales season ended \u2014 but it could return next year. This adds uncertainty and potential instability to planner behavior.<\/p>\n<\/li>\n<li>\n<p>How big should the estimate error be to justify new stats?\u00a0\u00d72? \u00d710? \u00d7100?<\/p>\n<\/li>\n<\/ul>\n<p>So, I decided to start with the simpler Rule 1 \u2014 and build the tech for Rule 2 later. The plan is to create a recommender that analyzes\u00a0<code>pg_stat_statements<\/code>\u00a0and finished query plans, and suggests stat definitions \u2014 with reasons for each suggestion.<\/p>\n<h3>Extension overview<\/h3>\n<p>The extension idea is simple (see the\u00a0<a href=\"https:\/\/github.com\/danolivo\/pg_index_stats\" rel=\"noopener noreferrer nofollow\">repo<\/a>). First, we need a hook to collect created object IDs \u2014 I used\u00a0<code>object_access_hook<\/code>.<\/p>\n<p>Then we need to find the right time to filter for composite indexes and add stats definitions to the database \u2014\u00a0<code>ProcessUtility_hook<\/code>\u00a0worked well for this.<\/p>\n<p>Because extended stats (types\u00a0<code>distinct<\/code>\u00a0and\u00a0<code>dependencies<\/code>) are calculated for\u00a0<em>all<\/em>\u00a0column combinations, the computational cost grows fast:<\/p>\n<ul>\n<li>\n<p>3 columns = 4\u00a0<code>distinct<\/code>\u00a0stats and 9\u00a0<code>dependencies<\/code><\/p>\n<\/li>\n<li>\n<p>8 columns = 247 and 1016 respectively<br \/>No wonder PostgreSQL limits stats elements to 8.<\/p>\n<\/li>\n<\/ul>\n<p>To avoid overloading the DB, I added a\u00a0<code>columns_limit<\/code>\u00a0parameter (max columns per stat) and a\u00a0<code>stattypes<\/code>\u00a0parameter (which types to include).<\/p>\n<p>When auto-stats are created, a\u00a0<code>dependency<\/code>\u00a0is registered not only on the table but also on the index used as a template. So if the index is dropped, the stats go too.<br \/>Should the extension also own the stats (so\u00a0<code>DROP EXTENSION<\/code>\u00a0deletes them)? Not sure \u2014 the extension can work as a module without\u00a0<code>CREATE EXTENSION<\/code>, affecting all DBs in the cluster.<\/p>\n<p>To separate auto-generated stats from manual ones, the extension adds a comment to each stat object with the library and stat name.<\/p>\n<p>There are also\u00a0<code>pg_index_stats_remove<\/code>\u00a0and\u00a0<code>pg_index_stats_rebuild<\/code>\u00a0functions for bulk cleanup and regeneration \u2014 useful if the schema existed before loading the module or DB settings changed.<\/p>\n<p>One tricky part is reducing redundant stats. With many indexes, we need to cut down the ANALYZE load. I introduced a deduplication procedure (see\u00a0<code>pg_index_stats.compactify<\/code>).<\/p>\n<p>Example: if you have an index on\u00a0<code>t(x1,x2)<\/code>, and create one on\u00a0<code>t(x2,x1)<\/code>, no new stats needed. If you already have\u00a0<code>t(x1,x2,x3)<\/code>, then creating\u00a0<code>t(x2,x1)<\/code>\u00a0<em>does<\/em>\u00a0require new\u00a0<code>MCV<\/code>\u00a0stats \u2014 but\u00a0<code>distinct<\/code>\u00a0and\u00a0<code>dependencies<\/code>\u00a0can be skipped.<\/p>\n<h3>Experiment<\/h3>\n<p>Theory is nice, but you need practice. I didn\u2019t have a loaded PostgreSQL instance handy, so I found an old dump from a system with ~10k tables and 3\u00d7 as many indexes. About 20k of those indexes had more than one column, and over 1k had five or more.<\/p>\n<p>Good test client \u2014 too bad there\u2019s no useful workload. Running\u00a0<code>ANALYZE<\/code>\u00a0on this DB took 22 seconds. With the extension and a 5-column limit, it took 55 seconds.<\/p>\n<p>Raw ANALYZE times by column limit and stat types:<\/p>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">Limit<\/p>\n<\/th>\n<th>\n<p align=\"left\">MCV<\/p>\n<\/th>\n<th>\n<p align=\"left\">MCV + NDISTINCT<\/p>\n<\/th>\n<th>\n<p align=\"left\">MCV + NDISTINCT + DEPENDENCIES<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">2<\/p>\n<\/td>\n<td>\n<p align=\"left\">21s<\/p>\n<\/td>\n<td>\n<p align=\"left\">27s<\/p>\n<\/td>\n<td>\n<p align=\"left\">28s<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">4<\/p>\n<\/td>\n<td>\n<p align=\"left\">23s<\/p>\n<\/td>\n<td>\n<p align=\"left\">37s<\/p>\n<\/td>\n<td>\n<p align=\"left\">67s<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">5<\/p>\n<\/td>\n<td>\n<p align=\"left\">24s<\/p>\n<\/td>\n<td>\n<p align=\"left\">51s<\/p>\n<\/td>\n<td>\n<p align=\"left\">118s<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">6<\/p>\n<\/td>\n<td>\n<p align=\"left\">25s<\/p>\n<\/td>\n<td>\n<p align=\"left\">68s<\/p>\n<\/td>\n<td>\n<p align=\"left\">196s<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">8<\/p>\n<\/td>\n<td>\n<p align=\"left\">30s<\/p>\n<\/td>\n<td>\n<p align=\"left\">137s<\/p>\n<\/td>\n<td>\n<p align=\"left\">574s<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>Clearly, covering all column combinations gets expensive \u2014 especially\u00a0<code>dependencies<\/code>. So either keep the limit at 3\u20135 columns, or go with Rule 2. Now I understand why SQL Server has a separate worker just for updating such stats \u2014 it\u00a0is\u00a0costly.<\/p>\n<p>How about redundancy cleanup? Let\u2019s run another test:<\/p>\n<pre><code class=\"sql\">SET pg_index_stats.columns_limit = 5; SET pg_index_stats.stattypes = 'mcv, ndistinct, dependencies'; SET pg_index_stats.compactify = 'off'; SELECT pg_index_stats_rebuild(); ANALYZE;  SET pg_index_stats.compactify = 'on'; SELECT pg_index_stats_rebuild(); ANALYZE;<\/code><\/pre>\n<p>To monitor stat objects, run:<\/p>\n<pre><code class=\"sql\">-- Total number of stat items SELECT sum(nelems) FROM (   SELECT array_length(stxkind,1) AS nelems   FROM pg_statistic_ext );  -- Total per stat type SELECT elem, count(elem) FROM (   SELECT unnest(stxkind) elem FROM pg_statistic_ext ) GROUP BY elem; <\/code><\/pre>\n<p>Results:<\/p>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">Metric<\/p>\n<\/th>\n<th>\n<p align=\"left\">Without dedup<\/p>\n<\/th>\n<th>\n<p align=\"left\">With dedup<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">ANALYZE time (s)<\/p>\n<\/td>\n<td>\n<p align=\"left\">141<\/p>\n<\/td>\n<td>\n<p align=\"left\">123<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Total stat elements<\/p>\n<\/td>\n<td>\n<p align=\"left\">74,353<\/p>\n<\/td>\n<td>\n<p align=\"left\">61,409<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">MCV<\/p>\n<\/td>\n<td>\n<p align=\"left\">24,783<\/p>\n<\/td>\n<td>\n<p align=\"left\">24,089<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">DISTINCT<\/p>\n<\/td>\n<td>\n<p align=\"left\">24,783<\/p>\n<\/td>\n<td>\n<p align=\"left\">18,658<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">DEPENDENCIES<\/p>\n<\/td>\n<td>\n<p align=\"left\">24,783<\/p>\n<\/td>\n<td>\n<p align=\"left\">18,658<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">EXPRESSIONS<\/p>\n<\/td>\n<td>\n<p align=\"left\">4<\/p>\n<\/td>\n<td>\n<p align=\"left\">4<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>The gain isn\u2019t huge \u2014 about 15% on time and more on stats volume. But still, protection against corner cases. Interesting note: dedup reduced MCV stats \u2014 meaning many indexes differ only in column order. Also,\u00a0<code>expression<\/code>\u00a0stats showed up \u2014 even though we never mentioned them. PostgreSQL creates these automatically if the definition includes expressions. Not a huge issue, but it would be nice to control this behavior.<\/p>\n<h3>Comparing with joinsel<\/h3>\n<p>At Postgres Professional, we have another stats collector \u2014\u00a0<a href=\"https:\/\/postgrespro.com\/docs\/enterprise\/17\/runtime-config-query#GUC-ENABLE-COMPOUND-INDEX-STATS\" rel=\"noopener noreferrer nofollow\">joinsel<\/a>. It\u2019s not a direct competitor to\u00a0<code>extended statistics<\/code>. It creates a composite type based on the index definition and uses regular stats in\u00a0<code>pg_statistic<\/code>.<\/p>\n<p>Pros: includes MCV, histogram (so it can handle range filters), and works with PostgreSQL core mechanics.<\/p>\n<p>Cons: no\u00a0<code>dependencies<\/code>, and only one\u00a0<code>ndistinct<\/code>\u00a0value for the whole composite type (though this can be fixed).<\/p>\n<p>Let\u2019s test ANALYZE with\u00a0<code>joinsel<\/code>:<\/p>\n<pre><code class=\"sql\">SET enable_compound_index_stats = 'on'; SELECT pg_index_stats_remove(); \\timing on ANALYZE;  Time: 41248.977 ms (00:41.249)<\/code><\/pre>\n<p>ANALYZE time doubled, which is reasonable. And the best part \u2014 the complexity grows linearly with index column count.<\/p>\n<h3>Conclusion<\/h3>\n<p>The verdict on Rule 1: with careful limits, it\u2019s a valid and usable approach.<\/p>\n<p>Another takeaway: we need better\u00a0<code>extended statistics<\/code>\u00a0tools in PostgreSQL core \u2014 to fine-tune what gets generated.<\/p>\n<p>As for the assistant and Rule 2? Left for better times. If some brave soul with free time and patience wants to take it on \u2014 hit me up, I\u2019ll help you get started.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p><!----><!----><\/div>\n<p><!----><\/p>\n<div class=\"tm-article-poll-container\"><!--[--><\/p>\n<div class=\"tm-article-poll tm-article-poll_variant-bordered\">\n<div class=\"tm-notice tm-notice_positive tm-article-poll__notice\"><!----><\/p>\n<div class=\"tm-notice__inner\"><!----><\/p>\n<div class=\"tm-notice__content\" data-test-id=\"notice-content\"><!--[--><span>\u0422\u043e\u043b\u044c\u043a\u043e \u0437\u0430\u0440\u0435\u0433\u0438\u0441\u0442\u0440\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u044b\u0435 \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u0438 \u043c\u043e\u0433\u0443\u0442 \u0443\u0447\u0430\u0441\u0442\u0432\u043e\u0432\u0430\u0442\u044c \u0432 \u043e\u043f\u0440\u043e\u0441\u0435. <a rel=\"nofollow\" href=\"\/kek\/v1\/auth\/habrahabr\/?back=\/ru\/companies\/postgrespro\/articles\/914702\/&#038;hl=ru\">\u0412\u043e\u0439\u0434\u0438\u0442\u0435<\/a>, \u043f\u043e\u0436\u0430\u043b\u0443\u0439\u0441\u0442\u0430.<\/span><!--]--><\/div>\n<\/div>\n<\/div>\n<p><!--[--><\/p>\n<div class=\"tm-article-poll__header\">Have you run into cases where standard table stats weren\u2019t enough?<\/div>\n<div class=\"tm-article-poll__answers\"><!--[--><\/p>\n<div class=\"tm-article-poll__answer\">\n<div class=\"tm-article-poll__answer-data\"><span class=\"tm-article-poll__answer-percent\">0% <\/span><span class=\"tm-article-poll__answer-label\">Yes<\/span><span class=\"tm-article-poll__answer-votes\">0<\/span><\/div>\n<div class=\"tm-article-poll__answer-bar\">\n<div class=\"tm-article-poll__answer-progress\" style=\"width: 0%\"><\/div>\n<\/div>\n<\/div>\n<div class=\"tm-article-poll__answer\">\n<div class=\"tm-article-poll__answer-data\"><span class=\"tm-article-poll__answer-percent tm-article-poll__answer-percent_winning\">0% <\/span><span class=\"tm-article-poll__answer-label\">No<\/span><span class=\"tm-article-poll__answer-votes\">0<\/span><\/div>\n<div class=\"tm-article-poll__answer-bar\">\n<div class=\"tm-article-poll__answer-progress tm-article-poll__answer-progress_winning\" style=\"width: 0%\"><\/div>\n<\/div>\n<\/div>\n<p><!--]--><\/div>\n<div class=\"tm-article-poll__stats\"> \u041d\u0438\u043a\u0442\u043e \u0435\u0449\u0435 \u043d\u0435 \u0433\u043e\u043b\u043e\u0441\u043e\u0432\u0430\u043b.   \u0412\u043e\u0437\u0434\u0435\u0440\u0436\u0430\u0432\u0448\u0438\u0445\u0441\u044f \u043d\u0435\u0442. <\/div>\n<p><!--]--><\/div>\n<p><!--]--><\/div>\n<p> \u0441\u0441\u044b\u043b\u043a\u0430 \u043d\u0430 \u043e\u0440\u0438\u0433\u0438\u043d\u0430\u043b \u0441\u0442\u0430\u0442\u044c\u0438 <a href=\"https:\/\/habr.com\/ru\/articles\/914702\/\"> https:\/\/habr.com\/ru\/articles\/914702\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<div><!--[--><!--]--><\/div>\n<div id=\"post-content-body\">\n<div>\n<div class=\"article-formatted-body article-formatted-body article-formatted-body_version-2\">\n<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<h4>Extended statistics \u2014 what is it?<\/h4>\n<p>The\u00a0extended statistics\u00a0feature lets you tell PostgreSQL to collect additional statistics on a set of table columns. Why would you need that? Let me quickly explain using the <a href=\"https:\/\/datasets.wri.org\/dataset\/globalpowerplantdatabase\" rel=\"noopener noreferrer nofollow\">public power plant dataset<\/a>. For instance, the\u00a0<code>primary_fuel<\/code>\u00a0used by a power plant is implicitly tied to the\u00a0<code>country<\/code>\u00a0column. So, if you run a simple query like:<\/p>\n<pre><code class=\"sql\">SELECT count(*) FROM power_plants WHERE country = '&lt;XXX&gt;' AND primary_fuel = 'Solar';<\/code><\/pre>\n<p>You\u2019ll see that Norway returns 0 rows, while Spain returns 243. That\u2019s obvious to us \u2014 latitude, climate, etc. \u2014 but the database doesn\u2019t know that. During query planning, PostgreSQL estimates the row count incorrectly: 93 for Norway and 253 for Spain. In a more complex query, where this estimate feeds into a JOIN or some other operator, the consequences can be unfortunate. Extended statistics calculate joint value distributions across column sets and help detect these dependencies.<\/p>\n<p>ORMs often make things worse. In this same power plant dataset, you might see filters on\u00a0<code>country<\/code>\u00a0and\u00a0<code>country_long<\/code>. From their descriptions, it\u2019s clear these two fields are directly related. But when an ORM generates a GROUP BY on both fields, the estimation fails dramatically:<\/p>\n<pre><code class=\"sql\">EXPLAIN (ANALYZE, COSTS ON, TIMING OFF, BUFFERS OFF, SUMMARY OFF) SELECT country, country_long FROM power_plants GROUP BY country, country_long;  HashAggregate  (rows=3494 width=16) (actual rows=167.00 loops=1)   Group Key: country, country_long   -&gt;  Seq Scan on power_plants  (rows=34936 width=16)                                 (actual rows=34936.00 loops=1)<\/code><\/pre>\n<p>A human would never write that query, but in the age of AI and auto-generated queries, this is the kind of mess we need to deal with.<\/p>\n<p>So what does\u00a0extended statistics\u00a0actually offer? It lets you define three types of stats on a column set:\u00a0MCV\u00a0(Most Common Values),\u00a0ndistinct, and\u00a0dependencies.<\/p>\n<p>For scan filters,\u00a0MCV\u00a0is the most effective: if the value combo you\u2019re filtering on is common in the table, the planner will estimate it accurately. If the combo is rare (like solar power in Norway), then PostgreSQL uses rough estimates like\u00a0<code>ntuples \/ ndistinct<\/code>, but can improve them by excluding values from the MCV list.<\/p>\n<p>For estimating group counts (GROUP BY, DISTINCT, IncrementalSort, Memoize, Hash Join), the\u00a0<code>ndistinct<\/code>\u00a0stat on column combinations is very helpful.<\/p>\n<p>To see the effect of extended stats, let\u2019s apply it to our example:<\/p>\n<pre><code class=\"sql\">CREATE STATISTICS ON country,primary_fuel FROM power_plants; ANALYZE;<\/code><\/pre>\n<p>Now the earlier query gives much more accurate cardinality estimates for filtering and grouping by these two fields. For Norway, it estimates 1 row; for Spain, 253 \u2014 feel free to try it with\u00a0<code>country = 'RUS'<\/code>\u00a0or\u00a0<code>'AUT'<\/code>. Sure, the table isn\u2019t that big, but the tool clearly works.<\/p>\n<p>Yet I rarely see\u00a0extended statistics\u00a0used in the wild. Probably one reason is fear that\u00a0<code>ANALYZE<\/code>\u00a0will take too long. But more likely, it\u2019s the complexity of knowing when and where to create the stats.<\/p>\n<h3>Finding the right stats definition<\/h3>\n<p>Are there any empirical rules for when and what kind of stats to create? I came up with two rules of thumb:<\/p>\n<p><strong>Rule 1 is based on index definitions<\/strong>. If a DBA risks creating an index on a set of columns, it probably means queries will often filter on those columns. And query performance on those filters is critical \u2014 a good reason to improve plan quality.<\/p>\n<p>Of course, not every multi-column filter suffers from bad estimates. That\u2019s a drawback of this rule \u2014 we might end up creating useless stats. And if the expected use case is point lookups, a small estimation error (1 vs 5 rows) doesn\u2019t matter much.<\/p>\n<p><strong>Rule 2 based on real-world filter patterns<\/strong>. Here, we pick candidate queries using two factors: (1) how much load the query puts on the DB (e.g.\u00a0<a href=\"https:\/\/habr.com\/ru\/companies\/postgrespro\/articles\/914698\/\" rel=\"noopener noreferrer nofollow\">pages-read<\/a>), and (2) whether it contains multi-column scan filters. It would also be nice to only consider cases where the actual row count differs significantly from the estimate.<\/p>\n<p>This rule is more selective \u2014 it reduces the number of generated stats definitions. But it raises tricky questions:<\/p>\n<ul>\n<li>\n<p>When to create stats?\u00a0With Rule 1, you create stats when you create the index. With Rule 2, you either use a timer to collect queries, or trigger it manually. The absence of a heavy bonus-calculating query for 29 days doesn\u2019t mean we don\u2019t want it fast on day 30. Even if its load is low, the accountant might not want to wait hours for it to run.<\/p>\n<\/li>\n<li>\n<p>How to clean up old stats?\u00a0With Rule 1, stats disappear when you drop the index. With Rule 2, it\u2019s not so clear. A query might stop showing up because the sales season ended \u2014 but it could return next year. This adds uncertainty and potential instability to planner behavior.<\/p>\n<\/li>\n<li>\n<p>How big should the estimate error be to justify new stats?\u00a0\u00d72? \u00d710? \u00d7100?<\/p>\n<\/li>\n<\/ul>\n<p>So, I decided to start with the simpler Rule 1 \u2014 and build the tech for Rule 2 later. The plan is to create a recommender that analyzes\u00a0<code>pg_stat_statements<\/code>\u00a0and finished query plans, and suggests stat definitions \u2014 with reasons for each suggestion.<\/p>\n<h3>Extension overview<\/h3>\n<p>The extension idea is simple (see the\u00a0<a href=\"https:\/\/github.com\/danolivo\/pg_index_stats\" rel=\"noopener noreferrer nofollow\">repo<\/a>). First, we need a hook to collect created object IDs \u2014 I used\u00a0<code>object_access_hook<\/code>.<\/p>\n<p>Then we need to find the right time to filter for composite indexes and add stats definitions to the database \u2014\u00a0<code>ProcessUtility_hook<\/code>\u00a0worked well for this.<\/p>\n<p>Because extended stats (types\u00a0<code>distinct<\/code>\u00a0and\u00a0<code>dependencies<\/code>) are calculated for\u00a0<em>all<\/em>\u00a0column combinations, the computational cost grows fast:<\/p>\n<ul>\n<li>\n<p>3 columns = 4\u00a0<code>distinct<\/code>\u00a0stats and 9\u00a0<code>dependencies<\/code><\/p>\n<\/li>\n<li>\n<p>8 columns = 247 and 1016 respectively<br \/>No wonder PostgreSQL limits stats elements to 8.<\/p>\n<\/li>\n<\/ul>\n<p>To avoid overloading the DB, I added a\u00a0<code>columns_limit<\/code>\u00a0parameter (max columns per stat) and a\u00a0<code>stattypes<\/code>\u00a0parameter (which types to include).<\/p>\n<p>When auto-stats are created, a\u00a0<code>dependency<\/code>\u00a0is registered not only on the table but also on the index used as a template. So if the index is dropped, the stats go too.<br \/>Should the extension also own the stats (so\u00a0<code>DROP EXTENSION<\/code>\u00a0deletes them)? Not sure \u2014 the extension can work as a module without\u00a0<code>CREATE EXTENSION<\/code>, affecting all DBs in the cluster.<\/p>\n<p>To separate auto-generated stats from manual ones, the extension adds a comment to each stat object with the library and stat name.<\/p>\n<p>There are also\u00a0<code>pg_index_stats_remove<\/code>\u00a0and\u00a0<code>pg_index_stats_rebuild<\/code>\u00a0functions for bulk cleanup and regeneration \u2014 useful if the schema existed before loading the module or DB settings changed.<\/p>\n<p>One tricky part is reducing redundant stats. With many indexes, we need to cut down the ANALYZE load. I introduced a deduplication procedure (see\u00a0<code>pg_index_stats.compactify<\/code>).<\/p>\n<p>Example: if you have an index on\u00a0<code>t(x1,x2)<\/code>, and create one on\u00a0<code>t(x2,x1)<\/code>, no new stats needed. If you already have\u00a0<code>t(x1,x2,x3)<\/code>, then creating\u00a0<code>t(x2,x1)<\/code>\u00a0<em>does<\/em>\u00a0require new\u00a0<code>MCV<\/code>\u00a0stats \u2014 but\u00a0<code>distinct<\/code>\u00a0and\u00a0<code>dependencies<\/code>\u00a0can be skipped.<\/p>\n<h3>Experiment<\/h3>\n<p>Theory is nice, but you need practice. I didn\u2019t have a loaded PostgreSQL instance handy, so I found an old dump from a system with ~10k tables and 3\u00d7 as many indexes. About 20k of those indexes had more than one column, and over 1k had five or more.<\/p>\n<p>Good test client \u2014 too bad there\u2019s no useful workload. Running\u00a0<code>ANALYZE<\/code>\u00a0on this DB took 22 seconds. With the extension and a 5-column limit, it took 55 seconds.<\/p>\n<p>Raw ANALYZE times by column limit and stat types:<\/p>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">Limit<\/p>\n<\/th>\n<th>\n<p align=\"left\">MCV<\/p>\n<\/th>\n<th>\n<p align=\"left\">MCV + NDISTINCT<\/p>\n<\/th>\n<th>\n<p align=\"left\">MCV + NDISTINCT + DEPENDENCIES<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">2<\/p>\n<\/td>\n<td>\n<p align=\"left\">21s<\/p>\n<\/td>\n<td>\n<p align=\"left\">27s<\/p>\n<\/td>\n<td>\n<p align=\"left\">28s<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">4<\/p>\n<\/td>\n<td>\n<p align=\"left\">23s<\/p>\n<\/td>\n<td>\n<p align=\"left\">37s<\/p>\n<\/td>\n<td>\n<p align=\"left\">67s<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">5<\/p>\n<\/td>\n<td>\n<p align=\"left\">24s<\/p>\n<\/td>\n<td>\n<p align=\"left\">51s<\/p>\n<\/td>\n<td>\n<p align=\"left\">118s<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">6<\/p>\n<\/td>\n<td>\n<p align=\"left\">25s<\/p>\n<\/td>\n<td>\n<p align=\"left\">68s<\/p>\n<\/td>\n<td>\n<p align=\"left\">196s<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">8<\/p>\n<\/td>\n<td>\n<p align=\"left\">30s<\/p>\n<\/td>\n<td>\n<p align=\"left\">137s<\/p>\n<\/td>\n<td>\n<p align=\"left\">574s<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>Clearly, covering all column combinations gets expensive \u2014 especially\u00a0<code>dependencies<\/code>. So either keep the limit at 3\u20135 columns, or go with Rule 2. Now I understand why SQL Server has a separate worker just for updating such stats \u2014 it\u00a0is\u00a0costly.<\/p>\n<p>How about redundancy cleanup? Let\u2019s run another test:<\/p>\n<pre><code class=\"sql\">SET pg_index_stats.columns_limit = 5; SET pg_index_stats.stattypes = 'mcv, ndistinct, dependencies'; SET pg_index_stats.compactify = 'off'; SELECT pg_index_stats_rebuild(); ANALYZE;  SET pg_index_stats.compactify = 'on'; SELECT pg_index_stats_rebuild(); ANALYZE;<\/code><\/pre>\n<p>To monitor stat objects, run:<\/p>\n<pre><code class=\"sql\">-- Total number of stat items SELECT sum(nelems) FROM (   SELECT array_length(stxkind,1) AS nelems   FROM pg_statistic_ext );  -- Total per stat type SELECT elem, count(elem) FROM (   SELECT unnest(stxkind) elem FROM pg_statistic_ext ) GROUP BY elem; <\/code><\/pre>\n<p>Results:<\/p>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<th>\n<p align=\"left\">Metric<\/p>\n<\/th>\n<th>\n<p align=\"left\">Without dedup<\/p>\n<\/th>\n<th>\n<p align=\"left\">With dedup<\/p>\n<\/th>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">ANALYZE time (s)<\/p>\n<\/td>\n<td>\n<p align=\"left\">141<\/p>\n<\/td>\n<td>\n<p align=\"left\">123<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Total stat elements<\/p>\n<\/td>\n<td>\n<p align=\"left\">74,353<\/p>\n<\/td>\n<td>\n<p align=\"left\">61,409<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">MCV<\/p>\n<\/td>\n<td>\n<p align=\"left\">24,783<\/p>\n<\/td>\n<td>\n<p align=\"left\">24,089<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">DISTINCT<\/p>\n<\/td>\n<td>\n<p align=\"left\">24,783<\/p>\n<\/td>\n<td>\n<p align=\"left\">18,658<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">DEPENDENCIES<\/p>\n<\/td>\n<td>\n<p align=\"left\">24,783<\/p>\n<\/td>\n<td>\n<p align=\"left\">18,658<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">EXPRESSIONS<\/p>\n<\/td>\n<td>\n<p align=\"left\">4<\/p>\n<\/td>\n<td>\n<p align=\"left\">4<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>The gain isn\u2019t huge \u2014 about 15% on time and more on stats volume. But still, protection against corner cases. Interesting note: dedup reduced MCV stats \u2014 meaning many indexes differ only in column order. Also,\u00a0<code>expression<\/code>\u00a0stats showed up \u2014 even though we never mentioned them. PostgreSQL creates these automatically if the definition includes expressions. Not a huge issue, but it would be nice to control this behavior.<\/p>\n<h3>Comparing with joinsel<\/h3>\n<p>At Postgres Professional, we have another stats collector \u2014\u00a0<a href=\"https:\/\/postgrespro.com\/docs\/enterprise\/17\/runtime-config-query#GUC-ENABLE-COMPOUND-INDEX-STATS\" rel=\"noopener noreferrer nofollow\">joinsel<\/a>. It\u2019s not a direct competitor to\u00a0<code>extended statistics<\/code>. It creates a composite type based on the index definition and uses regular stats in\u00a0<code>pg_statistic<\/code>.<\/p>\n<p>Pros: includes MCV, histogram (so it can handle range filters), and works with PostgreSQL core mechanics.<\/p>\n<p>Cons: no\u00a0<code>dependencies<\/code>, and only one\u00a0<code>ndistinct<\/code>\u00a0value for the whole composite type (though this can be fixed).<\/p>\n<p>Let\u2019s test ANALYZE with\u00a0<code>joinsel<\/code>:<\/p>\n<pre><code class=\"sql\">SET enable_compound_index_stats = 'on'; SELECT pg_index_stats_remove(); \\timing on ANALYZE;  Time: 41248.977 ms (00:41.249)<\/code><\/pre>\n<p>ANALYZE time doubled, which is reasonable. And the best part \u2014 the complexity grows linearly with index column count.<\/p>\n<h3>Conclusion<\/h3>\n<p>The verdict on Rule 1: with careful limits, it\u2019s a valid and usable approach.<\/p>\n<p>Another takeaway: we need better\u00a0<code>extended statistics<\/code>\u00a0tools in PostgreSQL core \u2014 to fine-tune what gets generated.<\/p>\n<p>As for the assistant and Rule 2? Left for better times. If some brave soul with free time and patience wants to take it on \u2014 hit me up, I\u2019ll help you get started.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p><!----><!----><\/div>\n<p><!----><\/p>\n<div class=\"tm-article-poll-container\"><!--[--><\/p>\n<div class=\"tm-article-poll tm-article-poll_variant-bordered\">\n<div class=\"tm-notice tm-notice_positive tm-article-poll__notice\"><!----><\/p>\n<div class=\"tm-notice__inner\"><!----><\/p>\n<div class=\"tm-notice__content\" data-test-id=\"notice-content\"><!--[--><span>\u0422\u043e\u043b\u044c\u043a\u043e \u0437\u0430\u0440\u0435\u0433\u0438\u0441\u0442\u0440\u0438\u0440\u043e\u0432\u0430\u043d\u043d\u044b\u0435 \u043f\u043e\u043b\u044c\u0437\u043e\u0432\u0430\u0442\u0435\u043b\u0438 \u043c\u043e\u0433\u0443\u0442 \u0443\u0447\u0430\u0441\u0442\u0432\u043e\u0432\u0430\u0442\u044c \u0432 \u043e\u043f\u0440\u043e\u0441\u0435. <a rel=\"nofollow\" href=\"\/kek\/v1\/auth\/habrahabr\/?back=\/ru\/companies\/postgrespro\/articles\/914702\/&#038;hl=ru\">\u0412\u043e\u0439\u0434\u0438\u0442\u0435<\/a>, \u043f\u043e\u0436\u0430\u043b\u0443\u0439\u0441\u0442\u0430.<\/span><!--]--><\/div>\n<\/div>\n<\/div>\n<p><!--[--><\/p>\n<div class=\"tm-article-poll__header\">Have you run into cases where <\/div>\n<\/div>\n<\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-465875","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/465875","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=465875"}],"version-history":[{"count":0,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/465875\/revisions"}],"wp:attachment":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=465875"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=465875"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=465875"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}