Saturday, April 19, 2025

Battle of the Mallocators: part 2

This post addresses some of the feedback I received from my previous post on the impact of the malloc library when using RocksDB and MyRocks. Here I test:

  • MALLOC_ARENA_MAX with glibc malloc
    • see here for more background on MALLOC_ARENA_MAX. By default glibc can use too many arenas for some workloads (8 X number_of_CPU_cores) so I tested it with 1, 8, 48 and 96 arenas.
  • compiling RocksDB and MyRocks with jemalloc specific code enabled
    • In my previous results I just set malloc-lib in my.cnf which uses LD_LIBRARY_PATH to link with your favorite malloc library implementation.
tl;dr: jemalloc
  • For mysqld with jemalloc enabled via malloc-lib (LD_LIBRARY_PATH) versus mysqld with jemalloc specific code enabled
    • performance, VSZ and RSS were similar
  • After setting rocksdb_cache_dump=0 in the binary with jemalloc specific code
    • performance is slightly better (excluding the outlier, the benefit is up to 3%)
    • peak VSZ is cut in half
    • peak RSS is reduced by ~9%
tl;dr: glibc malloc on a 48-core server
  • With 1 arena performance is lousy but the RSS bloat is mostly solved
  • With 8, 48 or 96 arenas the RSS bloat is still there
  • With 48 arenas there are still significant (5% to 10%) performance drops
  • With 96 arenas the performance drop was mostly ~2%
Building MyRocks with jemalloc support

This was harder than I expected. The first step was easy -- I added these to the CMake command line, the first is for MyRocks and the second is for RocksDB. When the first is set then HAVE_JEMALLOC is defined in config.h. When the second is set then ROCKSDB_JEMALLOC is defined on the compiler command line.

  -DHAVE_JEMALLOC=1
  -DWITH_JEMALLOC=1
The hard part is that there were linker errors for unresolved symbols -- the open-source build was broken. The fix that worked for me is here. I removed libunwind.so and added libjemalloc.so in its place.

Running mysqld with MALLOC_ARENA_MAX

I wasn't sure if it was sufficient for me to set an environment variable when invoking mysqld_safe, so I just edited the mysqld_safe script to do that for me:

182a183,184
>   cmd="MALLOC_ARENA_MAX=1 $cmd"
>   echo Run :: $cmd

Results: jemalloc

The jemalloc specific code in MyRocks and RocksDB is useful but most of it is not there to boost performance. The jemalloc specific code most likely to boost performance is here in MyRocks and is enabled when rocksdb_cache_dump=0 is added to my.cnf.

Results are here for 3 setups:
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
    • This is the base case in the table below
    • this is what I used in my previous post and jemalloc is enabled via setting malloc-lib in my.cnf which uses LD_LIBRARY_PATH
  • fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
    • This is col-1 in the table below
    • MySQL with jemalloc specific code enabled at compile time
  • fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
    • This is col-2 in the table below
    • MySQL with jemalloc specific code enabled at compile time and rocksdb_cache_dump=0 added to my.cnf
These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
From the results below:
  • results in col-1 are similar to the base case. So compiling in the jemalloc specific code didn't help performance.
  • results in col-2 are slightly better than the base case with one outlier (hot-points). So consider setting rocksdb_cache_dump=0 in my.cnf after compiling in jemalloc specific code.
Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128

col-1 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
col-2 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128

col-1   col-2
0.92    1.40    hot-points_range=100
1.00    1.01    point-query_range=100
1.01    1.02    points-covered-pk_range=100
0.94    1.03    points-covered-si_range=100
1.01    1.02    points-notcovered-pk_range=100
0.98    1.02    points-notcovered-si_range=100
1.01    1.03    random-points_range=1000
1.01    1.02    random-points_range=100
0.99    1.00    random-points_range=10
0.98    1.00    range-covered-pk_range=100
0.96    0.97    range-covered-si_range=100
0.98    0.98    range-notcovered-pk_range=100
1.00    1.02    range-notcovered-si_range=100
0.98    1.00    read-only-count_range=1000
1.01    1.01    read-only-distinct_range=1000
0.99    0.99    read-only-order_range=1000
1.00    1.00    read-only_range=10000
0.99    0.99    read-only_range=100
0.99    1.00    read-only_range=10
0.98    0.99    read-only-simple_range=1000
0.99    0.99    read-only-sum_range=1000
0.98    0.98    scan_range=100
1.01    1.02    delete_range=100
1.01    1.03    insert_range=100
0.99    1.01    read-write_range=100
1.00    1.01    read-write_range=10
1.00    1.02    update-index_range=100
1.02    1.02    update-inlist_range=100
1.01    1.03    update-nonindex_range=100
0.99    1.01    update-one_range=100
1.01    1.03    update-zipf_range=100
1.00    1.01    write-only_range=10000

The impact on VSZ and RSS is interesting. The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). To save space I use abbreviated names for the binaries.
  • jemalloc.1
    • base case, fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128
  • jemalloc.2
    • col-1 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
    • This has little impact on VSZ and RSS
  • jemalloc.3
    • col-2 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
    • This cuts peak VSZ in half and reduces peak RSS by 9%
Peak values for MyRocks with 10G buffer pool
alloc           VSZ     RSS     RSS/10
jemalloc.1      45.6    12.2    1.22
jemalloc.2      46.0    12.5    1.25
jemalloc.3      20.2    11.6    1.16

Results: MALLOC_ARENA_MAX

The binaries tested are:
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128
    • base case in the table below
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128
    • col-1 in the table below
    • uses MALLOC_ARENA_MAX=1
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128
    • col-2 in the table below
    • uses MALLOC_ARENA_MAX=8
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128
    • col-3 in the table below
    • uses MALLOC_ARENA_MAX=48
  • fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128
    • col-4 in the table below
    • uses MALLOC_ARENA_MAX=48
These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
From the results below:
  • performance with 1 or 8 arenas is lousy
  • performance drops some (often 5% to 10%) with 48 arenas
  • performance drops ~2% with 96 arenas
Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128

col-1 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128
col-2 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128
col-3 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128
col-4 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128

col-1   col-2   col-3   col-4
0.89    0.78    0.72    0.78    hot-points_range=100
0.23    0.61    0.96    0.98    point-query_range=100
0.31    0.86    0.96    1.01    points-covered-pk_range=100
0.24    0.87    0.95    1.01    points-covered-si_range=100
0.31    0.86    0.97    1.01    points-notcovered-pk_range=100
0.20    0.86    0.97    1.00    points-notcovered-si_range=100
0.35    0.79    0.96    1.01    random-points_range=1000
0.30    0.87    0.96    1.01    random-points_range=100
0.23    0.67    0.96    0.99    random-points_range=10
0.06    0.48    0.92    0.96    range-covered-pk_range=100
0.14    0.52    0.97    0.99    range-covered-si_range=100
0.13    0.46    0.91    0.97    range-notcovered-pk_range=100
0.23    0.87    0.96    1.01    range-notcovered-si_range=100
0.23    0.76    0.97    0.99    read-only-count_range=1000
0.56    1.00    0.96    0.97    read-only-distinct_range=1000
0.20    0.47    0.90    0.94    read-only-order_range=1000
0.68    1.04    1.00    1.00    read-only_range=10000
0.21    0.76    0.98    0.99    read-only_range=100
0.19    0.70    0.97    0.99    read-only_range=10
0.21    0.58    0.94    0.98    read-only-simple_range=1000
0.19    0.57    0.95    1.00    read-only-sum_range=1000
0.53    0.98    1.00    1.01    scan_range=100
0.30    0.81    0.98    1.00    delete_range=100
0.50    0.92    1.00    1.00    insert_range=100
0.23    0.72    0.97    0.98    read-write_range=100
0.20    0.67    0.96    0.98    read-write_range=10
0.33    0.88    0.99    1.00    update-index_range=100
0.36    0.76    0.94    0.98    update-inlist_range=100
0.30    0.85    0.98    0.99    update-nonindex_range=100
0.86    0.98    1.00    1.01    update-one_range=100
0.32    0.86    0.98    0.98    update-zipf_range=100
0.27    0.80    0.97    0.98    write-only_range=10000

The impact on VSZ and RSS is interesting. The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). To save space I use abbreviated names for the binaries.

Using 1 arena prevents RSS bloat but comes at a huge cost in performance. If I had more time I would have tested for 2, 4 and 6 arenas but I don't think glibc malloc + RocksDB are meant to be.

Peak values for MyRocks with 10G buffer pool
alloc           VSZ     RSS     RSS/10
default         46.1    36.2    3.62
arena = 1       15.9    14.1    1.41
arena = 8       32.6    27.7    2.77
arena = 48      35.2    29.2    2.92
arena = 96      39.3    32.5    3.25


Friday, April 11, 2025

Battle of the Mallocators

If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 2025 (see here). The problem is that RocksDB can be an allocator stress test because it does an allocation (calls malloc) when a block is read from storage and then does a deallocation (calls free) on eviction. These allocations have very different lifetimes as some blocks remain cached for a long time and that leads to much larger RSS than expected when using glibc malloc. Fortunately, jemalloc and tcmalloc are better at tolerating that allocation pattern without making RSS too large.

I have yet to notice a similar problem with InnoDB, in part because it does a few large allocations at process start for the InnoDB buffer pool and it doesn't do malloc/free per block read from storage.

There was a recent claim from a MySQL performance expert, Dimitri Kravtchuk, that either RSS or VSZ can grow too large with InnoDB and jemalloc. I don't know all of the details for his setup and I failed to reproduce his result on my setup. Too be fair, I show here that VSZ for InnoDB + jemalloc can be larger than you might expect but that isn't a problem, it is just an artifact of jemalloc that can be confusing. But RSS for jemalloc with InnoDB is similar to what I get from tcmalloc.

tl;dr

  • For glibc malloc with MyRocks I get OOM on a server with 128G of RAM when the RocksDB buffer pool size is 50G. I might have been able to avoid OOM by using between 30G and 40G for the buffer pool. On that host I normally use jemalloc with MyRocks and a 100G buffer pool.
  • With respect to peak RSS
    • For InnoDB the peak RSS with all allocators is similar and peak RSS is ~1.06X larger than the InnoDB buffer pool.
    • For MyRocks the peak RSS is smallest with jemalloc, slightly larger with tcmalloc and much too large with glibc malloc. For (jemalloc, tcmalloc, glibc malloc) It was (1.22, 1.31, 3.62) times larger than the 10G MyRocks buffer pool. I suspect those ratios would be smaller for jemalloc and tcmalloc had I used an 80G buffer pool.
  • For performance, QPS with jemalloc and tcmalloc is slightly better than with glibc malloc
    • For InnoDB: [jemalloc, tcmalloc] get [2.5%, 3.5%] more QPS than glibc malloc
    • For MyRocks: [jemalloc, tcmalloc] get [5.1%, 3.0%] more QPS than glibc malloc

Prior art

I have several blog posts on using jemalloc with MyRocks.

  • October 2015 - MyRocks with glibc malloc, jemalloc and tcmalloc
  • April 2017 - Performance for large, concurrent allocations
  • April 2018 - RSS for MyRocks with jemalloc vs glibc malloc
  • August 2023 - RocksDB and glibc malloc
  • September 2023 - A regression in jemalloc 4.4.0 and 4.5.0 (too-large RSS) 
  • September 2023 - More on the regression in jemalloc 4.4.0 and 4.5.0
  • October 2023 - Even more on the regression in jemalloc 4.4.0 and 4.5.0

Builds, configuration and hardware

I compiled upstream MySQL 8.0.40 from source for InnoDB. I also compiled FB MySQL 8.0.32 from source for MyRocks. For FB MySQL I used source as of October 23, 2024 at git hash ba9709c9 with RocksDB 9.7.1.

The server is an ax162-s from Hetzner with 48 cores (AMD EPYC 9454P), 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 with SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

For malloc the server uses:
  • glibc
    • version2.35-0ubuntu3.9
  • tcmalloc
    • provided by libgoogle-perftools-dev and apt-cache show claims this is version 2.9.1
    • enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so in my.cnf
  • jemalloc
    • provided by libjemalloc-dev and apt-cache show claims this is version 5.2.1-4ubuntu1
    • enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libjemalloc.so in my.cnf

The configuration files are here for InnoDB and for MyRocks. For InnoDB I used an 80G buffer pool. I tried to use a 50G buffer pool for MyRocks but with glibc malloc there was OOM so I repeated all tests with a 10G buffer pool. I might have been able avoid OOM with MyRocks and glibc malloc by using a between 30G and 40G for MyRocks -- but I didn't want to spend more time figuring that out when the real answer is to use jemalloc or tcmalloc.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement.

The tests run with 16 tables and 50M rows/table. There are 256 client threads and each microbenchmark runs for 1200 seconds. Normally I don't run with (client threads / cores) >> 1 but I do so here to create more stress and to copy what I think Dimitri had done.

Normally when I run sysbench I configure it so that the test tables fit in the buffer pool (block cache) but I don't do that here because I want to MyRocks to do IO as allocations per storage read create much drama for the allocator.

The command line to run all tests is: bash r.sh 16 50000000 1200 1200 md2 1 0 256

Peak VSZ and RSS

The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). I am not sure it is fair to compare these ratios between InnoDB and MyRocks from this work because the buffer pool size is so much larger for InnoDB. Regardless, RSS is more than 3X larger than the MyRocks buffer pool size with glibc malloc and that is a problem.

Peak values for InnoDB with 80G buffer pool
alloc           VSZ     RSS     RSS/80
glibc           88.2    86.5    1.08
tcmalloc        88.1    85.3    1.06
jemalloc        91.5    87.0    1.08

Peak values for MyRocks with 10G buffer pool
alloc           VSZ     RSS     RSS/10
glibc           46.1    36.2    3.62
tcmalloc        15.3    13.1    1.31
jemalloc        45.6    12.2    1.22

Performance: InnoDB

From the results here, QPS is mostly similar between tcmalloc and jemalloc but there are a few microbenchmarks where tcmalloc is much better than jemalloc and those are highlighted.

The results for read-only_range=10000 are an outlier (tcmalloc much faster than jemalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are much larger for jemalloc than for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
Relative to results with glibc malloc
col-1 : results with tcmalloc
col-2 : results with jemalloc

col-1col-2
0.991.02hot-points_range=100
1.051.04point-query_range=100
0.960.99points-covered-pk_range=100
0.980.99points-covered-si_range=100
0.960.99points-notcovered-pk_range=100
0.970.98points-notcovered-si_range=100
0.971.00random-points_range=1000
0.950.99random-points_range=100
0.990.99random-points_range=10
1.041.03range-covered-pk_range=100
1.051.07range-covered-si_range=100
1.041.03range-notcovered-pk_range=100
0.981.00range-notcovered-si_range=100
1.021.02read-only-count_range=1000
1.051.07read-only-distinct_range=1000
1.071.12read-only-order_range=1000
1.281.09read-only_range=10000
1.031.05read-only_range=100
1.051.08read-only_range=10
1.081.07read-only-simple_range=1000
1.041.03read-only-sum_range=1000
1.021.02scan_range=100
1.011.00delete_range=100
1.031.01insert_range=100
1.021.02read-write_range=100
1.031.03read-write_range=10
1.011.02update-index_range=100
1.150.98update-inlist_range=100
1.060.99update-nonindex_range=100
1.031.03update-one_range=100
1.021.01update-zipf_range=100
1.181.05write-only_range=10000

Performance: MyRocks

From the results here, QPS is mostly similar between tcmalloc and jemalloc with a slight advantage for jemalloc but there are a few microbenchmarks where jemalloc is much better than tcmalloc and those are highlighted.

The results for hot-points below are odd (jemalloc is a lot faster than tcmalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are both much larger for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
Relative to results with glibc malloc
col-1 : results with tcmalloc
col-2 : results with jemalloc

col-1col-2
0.681.00hot-points_range=100
1.041.04point-query_range=100
1.091.09points-covered-pk_range=100
1.001.09points-covered-si_range=100
1.091.09points-notcovered-pk_range=100
1.101.12points-notcovered-si_range=100
1.081.08random-points_range=1000
1.091.09random-points_range=100
1.051.10random-points_range=10
0.991.07range-covered-pk_range=100
1.011.03range-covered-si_range=100
1.051.09range-notcovered-pk_range=100
1.101.09range-notcovered-si_range=100
1.071.05read-only-count_range=1000
1.001.00read-only-distinct_range=1000
0.981.04read-only-order_range=1000
1.031.03read-only_range=10000
0.961.03read-only_range=100
1.021.04read-only_range=10
0.981.07read-only-simple_range=1000
1.071.09read-only-sum_range=1000
1.021.02scan_range=100
1.051.03delete_range=100
1.111.07insert_range=100
0.960.97read-write_range=100
0.940.95read-write_range=10
1.081.04update-index_range=100
1.081.07update-inlist_range=100
1.091.04update-nonindex_range=100
1.041.04update-one_range=100
1.071.04update-zipf_range=100
1.031.02write-only_range=10000

Thursday, March 27, 2025

Postgres 17.4 vs sysbench on a large server, revisited part 2

I recently shared two posts (here and here) with results for sysbench on a large server using Postgres versions 10 through 17. In general there were several large improvements over time, but one small regression that arrived in Postgres 11.x. This blog post provides more details on the problem using results from Postgres 10.23, 11.22 and 17.4.

The regression occurs starting in Postgres 11.22 because Postgres is more likely to use bitmap index scan starting in 11.x. I have yet to learn why or how to prevent that.

Index scan vs bitmap index scan

Experts gave me great advice based on a few flamegraphs that I shared. It looked like Postgres started to use bitmap index scan more often starting in Postgres 11. Upstream sysbench does collect query plans for the SQL that it uses, so I modified my fork to do that.

While the explain output helps, it would help even more were there a feature in Postgres to provide optimizer traces, similar to what MySQL has, to understand why some query plans are selected and others are rejected. Another feature request is to improve the official Postgres docs to provide more detail on 1) the difference between index scan and bitmap index scan and 2) the difference between lossy and non-lossy bitmap index scans (AFAIK, one needs recheck).

The problem microbenchmark

The microbenchmarks in question do a range scan with aggregation and have a name like read-only_range=$X where X has been 10, 100 and 1000 but here I use X in 10, 100, 1000, 2000, 4000, 8000, 16000 and 32000. The value of X is the length of the range scan. These are run via oltp_read_only.lua and use these SQL statements.

Build, configuration, hardware, benchmark

These are described in my previous post. But a few things have changed for this report

  • I only tested Postgres versions 10.23, 11.22 and 17.4
  • I ran the read-only_range=$X microbenchmark for X in 10, 100, 1000, 2000, 4000, 8000, 16000 and 32000. Previously I ran it for X in 10, 100, 10000
  • I added new tests that each run only one of the SQL statements used by oltp_read_only.lua. All of the Lua scripts are here.
What didn't change is that the tests are run with 8 tables and 10M rows/table, read-heavy microbenchmarks run for 180 seconds and write-heavy run for 300 seconds.

The benchmark was repeated using configurations with work_mem set to 1MB, 2MB, 4MB, 8MB, 16MB and 32MB. The configuration files are here.

Query plans

This table shows that plans with bitmap index are far more frequent starting in Postgres 11. The numbers are similar if I count the number of occurrences of recheck.

The following show the number of occurrences of "bitmap index" in explain output for the read-only.range=X microbenchmarks. For read-only.range=1000 and read-only.range=2000 the counts are always 0. Note that the regressions are there at range=8000 and mostly don't occur for other values of range=X. It is interesting that 10.23 is least likely to use plans with bitmap index while 17.4 is most likely.

For read-only.range=4000
dbms      1        2      4       8       16      32 -> work_mem(MB)
10.23     0        0      0       0        0       0
11.22     0        0      0       0        0       0
17.4    723     1020    635     935     1059    1008

For read-only.range=8000
dbms       1       2       4       8      16      32 -> work_mem(MB)
10.23     40       0       1       0       0     166
11.22   1133    1237    1044    1252    1280    1231
17.4    1280    1278    1279    1280    1280    1280

For read-only.range=16000
dbms       1       2       4       8      16      32 -> work_mem(MB)
10.23   1279    1279    1279    1278    1278    1278
11.22   1280    1280    1279    1279    1280    1278
17.4    1279    1280    1279    1279    1279    1279

Example plans for distinct_range

The read-only.range=8000 test uses 4 types of SQL aggregation queries - distinct_range, order_range, simple_range and sum_range. This section has example plans for distinct_range at work_mem=16M.

Below I show that the regressions are largest for sum_range and simple_range and much smaller for distinct_range and order_range -- while plans for all of these are switching from index scan to bitmap index scan.

Postgres 10.23
explain SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 4087499 AND 4095498 ORDER BY c
Unique  (cost=28211.06..28250.59 rows=7907 width=121)
  ->  Sort  (cost=28211.06..28230.82 rows=7907 width=121)
        Sort Key: c
        ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.43..27699.12 rows=7907 width=121)
              Index Cond: ((id >= 4087499) AND (id <= 4095498))

Postgres 11.22
explain SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 1359956 AND 1367955 ORDER BY c
Unique  (cost=29781.72..29823.36 rows=8327 width=121)
  ->  Sort  (cost=29781.72..29802.54 rows=8327 width=121)
        Sort Key: c
        ->  Bitmap Heap Scan on sbtest1  (cost=269.79..29239.49 rows=8327 width=121)
              Recheck Cond: ((id >= 1359956) AND (id <= 1367955))
              ->  Bitmap Index Scan on sbtest1_pkey  (cost=0.00..267.70 rows=8327 width=0)
                    Index Cond: ((id >= 1359956) AND (id <= 1367955))

Postgres 17.4
explain SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 8646394 AND 8654393 ORDER BY c
Unique  (cost=31903.86..31949.03 rows=9033 width=121)
  ->  Sort  (cost=31903.86..31926.45 rows=9033 width=121)
        Sort Key: c
        ->  Bitmap Heap Scan on sbtest1  (cost=193.02..31310.35 rows=9033 width=121)
              Recheck Cond: ((id >= 8646394) AND (id <= 8654393))
              ->  Bitmap Index Scan on sbtest1_pkey  (cost=0.00..190.76 rows=9033 width=0)
                    Index Cond: ((id >= 8646394) AND (id <= 8654393))

Example plans for order_range

The read-only.range=8000 test uses 4 types of SQL aggregation queries - distinct_range, order_range, simple_range and sum_range. This section has example plans for order_range at work_mem=16M.

Below I show that the regressions are largest for sum_range and simple_range and much smaller for distinct_range and order_range -- while plans for all of these are switching from index scan to bitmap index scan.

Postgres 10.23
table 1 : explain SELECT c FROM sbtest1 WHERE id BETWEEN 9271446 AND 9279445 ORDER BY c
Sort  (cost=26775.57..26794.32 rows=7501 width=121)
  Sort Key: c
  ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.43..26292.77 rows=7501 width=121)
        Index Cond: ((id >= 9271446) AND (id <= 9279445))

Postgres 11.22
explain SELECT c FROM sbtest1 WHERE id BETWEEN 9375999 AND 9383998 ORDER BY c
Sort  (cost=30444.65..30465.97 rows=8529 width=121)
  Sort Key: c
  ->  Bitmap Heap Scan on sbtest1  (cost=275.86..29887.79 rows=8529 width=121)
        Recheck Cond: ((id >= 9375999) AND (id <= 9383998))
        ->  Bitmap Index Scan on sbtest1_pkey  (cost=0.00..273.73 rows=8529 width=0)
              Index Cond: ((id >= 9375999) AND (id <= 9383998))

Postgres 17.4
explain SELECT c FROM sbtest1 WHERE id BETWEEN 8530681 AND 8538680 ORDER BY c
Sort  (cost=27548.18..27567.43 rows=7701 width=121)
  Sort Key: c
  ->  Bitmap Heap Scan on sbtest1  (cost=167.37..27051.05 rows=7701 width=121)
        Recheck Cond: ((id >= 8530681) AND (id <= 8538680))
        ->  Bitmap Index Scan on sbtest1_pkey  (cost=0.00..165.44 rows=7701 width=0)
              Index Cond: ((id >= 8530681) AND (id <= 8538680))

Example plans for simple_range

The read-only.range=8000 test uses 4 types of SQL aggregation queries - distinct_range, order_range, simple_range and sum_range. This section has example plans for simple_range at work_mem=16M.

Below I show that the regressions are largest for sum_range and simple_range and much smaller for distinct_range and order_range -- while plans for all of these are switching from index scan to bitmap index scan.

Postgres 10.23
explain SELECT c FROM sbtest1 WHERE id BETWEEN 7681343 AND 7689342
Index Scan using sbtest1_pkey on sbtest1  (cost=0.43..28016.13 rows=7999 width=121)
  Index Cond: ((id >= 7681343) AND (id <= 7689342))

Postgres 11.22
explain SELECT c FROM sbtest1 WHERE id BETWEEN 1406209 AND 1414208
Bitmap Heap Scan on sbtest1  (cost=250.91..27065.17 rows=7656 width=121)
  Recheck Cond: ((id >= 1406209) AND (id <= 1414208))
  ->  Bitmap Index Scan on sbtest1_pkey  (cost=0.00..249.00 rows=7656 width=0)
        Index Cond: ((id >= 1406209) AND (id <= 1414208))

Postgres 17.4
explain SELECT c FROM sbtest1 WHERE id BETWEEN 5487727 AND 5495726
Bitmap Heap Scan on sbtest1  (cost=170.27..27961.99 rows=7984 width=121)
  Recheck Cond: ((id >= 5487727) AND (id <= 5495726))
  ->  Bitmap Index Scan on sbtest1_pkey  (cost=0.00..168.28 rows=7984 width=0)
        Index Cond: ((id >= 5487727) AND (id <= 5495726))

Example plans for sum_range

The read-only.range=8000 test uses 4 types of SQL aggregation queries - distinct_range, order_range, simple_range and sum_range. This section has example plans for sum_range at work_mem=16M.

Below I show that the regressions are largest for sum_range and simple_range and much smaller for distinct_range and order_range -- while plans for all of these are switching from index scan to bitmap index scan.

Postgres 10.23
explain SELECT SUM(k) FROM sbtest1 WHERE id BETWEEN 1117274 AND 1125273
Aggregate  (cost=27542.60..27542.61 rows=1 width=8)
  ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.43..27522.96 rows=7856 width=4)
        Index Cond: ((id >= 1117274) AND (id <= 1125273))

Postgres 11.22
explain SELECT SUM(k) FROM sbtest1 WHERE id BETWEEN 2318912 AND 2326911
Aggregate  (cost=28030.44..28030.45 rows=1 width=8)
  ->  Bitmap Heap Scan on sbtest1  (cost=257.90..28010.57 rows=7948 width=4)
        Recheck Cond: ((id >= 2318912) AND (id <= 2326911))
        ->  Bitmap Index Scan on sbtest1_pkey  (cost=0.00..255.92 rows=7948 width=0)
              Index Cond: ((id >= 2318912) AND (id <= 2326911))

Postgres 17.4
explain SELECT SUM(k) FROM sbtest1 WHERE id BETWEEN 1795996 AND 1803995
Aggregate  (cost=27179.49..27179.50 rows=1 width=8)
  ->  Bitmap Heap Scan on sbtest1  (cost=167.72..27160.16 rows=7735 width=4)
        Recheck Cond: ((id >= 1795996) AND (id <= 1803995))
        ->  Bitmap Index Scan on sbtest1_pkey  (cost=0.00..165.79 rows=7735 width=0)
              Index Cond: ((id >= 1795996) AND (id <= 1803995))

Results

While there are normally ~27 microbenchmarks (each microbenchmark uses sysbench to run tests from one Lua file) I added a few extra tests this time and I only share results from the read-only* microbenchmarks. Output from all tests is here.

The numbers below are the relative QPS which is the following where $version is either 11.22 or 17.4. When the relative QPS is < 1.0, then $version is slower than Postgres 10.23.
(QPS for $version) / (QPS for Postgres 10.23)

A summary of the results is:

  • the regression always occurs for the range=8000 microbenchmark and is less likely for other values for range=X. Note that range=X means the queries scan X rows.
  • from the four tests that each run only one of the SQL aggregation queries used by the standard read-only microbenchmark (read-only-distinct, read-only-order, read-only-simple and read-only-sum) the regression occurs in read-only-simple and read-only-sum but not in the others and the regression is the largest in read-only-sum. The SQL statements are here for read-only-distinct, read-only-order, read-only-simple and read-only-sum.

work_mem 1 MB
1.03    1.06    read-only_range=10
1.02    1.04    read-only_range=100
1.00    1.00    read-only_range=1000
1.00    1.02    read-only_range=2000
1.00    0.99    read-only_range=4000
0.95    0.95    read-only_range=8000
0.99    1.02    read-only_range=16000
1.00    1.04    read-only_range=32000
0.98    0.97    read-only-distinct_range=8000
0.98    0.99    read-only-order_range=8000
0.94    0.90    read-only-simple_range=8000
0.89    0.83    read-only-sum_range=8000

work_mem 2 MB
11.22   17.4
1.02    1.06    read-only_range=10
1.01    1.02    read-only_range=100
1.00    0.99    read-only_range=1000
0.99    1.01    read-only_range=2000
0.98    0.96    read-only_range=4000
0.94    0.93    read-only_range=8000
0.99    1.00    read-only_range=16000
0.98    1.02    read-only_range=32000
0.97    0.96    read-only-distinct_range=8000
0.96    0.98    read-only-order_range=8000
0.92    0.89    read-only-simple_range=8000
0.86    0.80    read-only-sum_range=8000


work_mem 4 MB
11.22   17.4
1.02    1.06    read-only_range=10
1.02    1.03    read-only_range=100
1.01    1.01    read-only_range=1000
1.00    1.02    read-only_range=2000
1.00    1.00    read-only_range=4000
0.96    0.94    read-only_range=8000
1.13    1.24    read-only_range=16000
1.06    1.11    read-only_range=32000
0.98    0.97    read-only-distinct_range=8000
0.98    0.99    read-only-order_range=8000
0.94    0.90    read-only-simple_range=8000
0.89    0.82    read-only-sum_range=8000


work_mem 8 MB
11.22   17.4
1.03    1.07    read-only_range=10
1.02    1.03    read-only_range=100
1.00    0.99    read-only_range=1000
1.00    1.01    read-only_range=2000
0.99    0.97    read-only_range=4000
0.95    0.94    read-only_range=8000
0.99    1.00    read-only_range=16000
0.99    1.03    read-only_range=32000
0.97    0.96    read-only-distinct_range=8000
0.97    0.98    read-only-order_range=8000
0.92    0.89    read-only-simple_range=8000
0.87    0.81    read-only-sum_range=8000


work_mem 16 MB
11.22   17.4
1.04    1.08    read-only_range=10
1.03    1.05    read-only_range=100
1.00    1.00    read-only_range=1000
1.00    1.02    read-only_range=2000
0.99    0.97    read-only_range=4000
0.94    0.94    read-only_range=8000
0.98    1.00    read-only_range=16000
0.99    1.03    read-only_range=32000
0.97    0.96    read-only-distinct_range=8000
0.97    0.99    read-only-order_range=8000
0.92    0.90    read-only-simple_range=8000
0.86    0.80    read-only-sum_range=8000


work_mem 32 MB
11.22   17.4
1.02    1.06    read-only_range=10
1.01    1.03    read-only_range=100
1.00    1.00    read-only_range=1000
0.99    1.02    read-only_range=2000
1.00    0.97    read-only_range=4000
0.95    0.94    read-only_range=8000
0.99    1.01    read-only_range=16000
0.99    1.04    read-only_range=32000
0.97    0.96    read-only-distinct_range=8000
0.97    0.99    read-only-order_range=8000
0.94    0.90    read-only-simple_range=8000
0.89    0.83    read-only-sum_range=8000

Sunday, March 16, 2025

At what level of concurrency do MySQL 5.7 and 8.0 become faster than 5.6?

Are MySQL 5.7 and 8.0 faster than 5.6? That depends a lot on the workload -- both types of SQL and amount of concurrency. Here I summarize results from sysbench on a larger server (48 cores) using 1, 4, 6, 8, 10, 20 and 40 clients to show how things change.

tl;dr

  • the workload here is microbenchmarks with a database cached by InnoDB
  • 5.7.44 is faster than 8.0.x at all concurrency levels on most microbenchmarks
  • for 5.6.51 vs 8.0.x
    • for point queries, 5.6.51 is faster at <= 8 clients
    • for range queries without aggregation 5.6.51 is always faster
    • for range queries with aggregation 5.6.51 is faster except at 40 clients
    • for writes, 5.6.51 is almost always faster at 10 or fewer clients (excluding update-index)
Performance summaries

For point queries:
  • 5.7.44 is always faster than 8.0
  • 8.0.28 suffers from bug 102037 - found by me with sysbench, fixed in 8.0.31
  • at what level of concurrency do most things get faster in 5.7 & 8.0 vs 5.6?
    • 5.7.44 becomes faster than 5.6.51 at 6+ clients
    • 8.0.x becomes faster than 5.6.51 at between 10 and 20 clients
    • Two of the microbenchmarks are always faster in 5.6.51 - both do non-covering queries on a secondary index
For range queries without aggregation
  • 5.7.44 is always faster than 8.0x
  • 5.6.51 is always faster than 5.7.44 and 8.0.x
For range queries with aggregation
  • 5.7.44 is almost always faster than 8.0.x
  • 5.7.44 becomes faster than 5.6.51 at 6+ clients
  • 8.0.x becomes faster than 5.6.51 at 40 clients
For writes
  • For update-index
    • 5.7.44 and 8.0.x are always faster than 5.6.51 at 4+ clients
    • There is an odd drop from ~6X to ~3X for 8.0.32 and 8.0.39 at 20 clients
  • 5.7.44 is mostly faster than 8.0.x for 1 to 20 clients and they have similar results at 40 clients
  • 5.7.44 & 8.0.x are always faster than 5.6.51 at 20+ clients
Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.28, 8.0.32, 8.0.39 and 8.0.41.

The server is an ax162-s from Hetzner with 48 cores (AMD EPYC 9454P), 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 with SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

The configuration files are named my.cnf.cz11a_c32r128 and here for 5.6.51, 5.7.44, 8.0.28, 8.0.32, 8.0.39 and 8.0.41.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The tests run with 8 tables and 10M rows/table. There are 40 client threads, read-heavy microbenchmarks run for 180 seconds and write-heavy run for 300 seconds.

The command lines to run all tests are:
bash r.sh 8 10000000 180 300 md2 1 1 1
bash r.sh 8 10000000 180 300 md2 1 1 4
bash r.sh 8 10000000 180 300 md2 1 1 6
bash r.sh 8 10000000 180 300 md2 1 1 8
bash r.sh 8 10000000 180 300 md2 1 1 10
bash r.sh 8 10000000 180 300 md2 1 1 20
bash r.sh 8 10000000 180 300 md2 1 1 40

Results

For the results below I split the microbenchmarks into 4 groups: point queries, range queries without aggregation, range queries with queries, writes. The spreadsheet with all data is here. Files with performance summaries for relative and absolute QPS are hereValues from iostat and vmstat per microbenchmark are here for 1 client, 4 clients, 6 clients, 8 clients, 10 clients, 20 clients and 40 clients. These help to explain why something is faster or slower because it shows how much HW is used per query.

The relative QPS is the following where $version is >= 5.7.44.
(QPS for $version) / (QPS for MySQL 5.6.51)
The numbers in the spreadsheets are the relative QPS. When the relative QPS is > 1 then $version is faster than MySQL 5.6.51. When it is 3.0 then $version is 3X faster than the base case.

Results: charts 

Notes on the charts

  • the y-axis shows the relative QPS
  • the y-axis starts at 0.80 to make it easier to see differences
  • in some cases the y-axis truncates the good outliers, cases where the relative QPS is greater than 1.5. I do this to improve readability for values near 1.0. Regardless, the improvements are nice.
Results: point queries

Summary
  • 5.7.44 is always faster than 8.0
  • 8.0.28 suffers from bug 102037 - found by me with sysbench, fixed in 8.0.31
  • at what level of concurrency do most things get faster in 5.7 & 8.0 vs 5.6?
    • 5.7.44 becomes faster than 5.6.51 at 6+ clients
    • 8.0.x becomes faster than 5.6.51 at between 10 and 20 clients
    • Two of the microbenchmarks are always faster in 5.6.51 - both do non-covering queries on a secondary index
Results: range queries without aggregation

Summary
  • 5.7.44 is always faster than 8.0x
  • 5.6.51 is always faster than 5.7.44 and 8.0.x
Results: range queries with aggregation

Summary
  • 5.7.44 is almost always faster than 8.0.x
  • 5.7.44 becomes faster than 5.6.51 at 6+ clients
  • 8.0.x becomes faster than 5.6.51 at 40 clients
Results: writes

The relative speedup for the update-index microbenchmark is frequently so large that it obscures the smaller changes on other microbenchmarks. So here I truncate the y-axis for some of the charts (for 6+ clients) and the section that follows has the charts without truncation.

Summary
  • For update-index
    • 5.7.44 and 8.0.x are always faster than 5.6.51 at 4+ clients
    • There is an odd drop from ~6X to ~3X for 8.0.32 and 8.0.39 at 20 clients but you can't see that on the charts in this section because of the truncation. It is visible in the next section. From vmstat I see an increase in CPU/operation (cpu/o) and context switches /operation (cs/o) at 20 clients but not at 40 clients.
  • 5.7.44 is mostly faster than 8.0.x for 1 to 20 clients and they have similar results at 40 clients
  • 5.7.44 & 8.0.x are always faster than 5.6.51 at 20+ clients
Results: charts for writes without truncation

The y-axis is truncated the the charts for writes in the previous section for 6+ clients. This section has those charts without truncation.

Battle of the Mallocators: part 2

This post addresses some of the feedback I received from my previous post on the impact of the malloc library when using RocksDB and MyRock...

close