If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 2025 (see here). The problem is that RocksDB can be an allocator stress test because it does an allocation (calls malloc) when a block is read from storage and then does a deallocation (calls free) on eviction. These allocations have very different lifetimes as some blocks remain cached for a long time and that leads to much larger RSS than expected when using glibc malloc. Fortunately, jemalloc and tcmalloc are better at tolerating that allocation pattern without making RSS too large.
I have yet to notice a similar problem with InnoDB, in part because it does a few large allocations at process start for the InnoDB buffer pool and it doesn't do malloc/free per block read from storage.
There was a recent claim from a MySQL performance expert, Dimitri Kravtchuk, that either RSS or VSZ can grow too large with InnoDB and jemalloc. I don't know all of the details for his setup and I failed to reproduce his result on my setup. Too be fair, I show here that VSZ for InnoDB + jemalloc can be larger than you might expect but that isn't a problem, it is just an artifact of jemalloc that can be confusing. But RSS for jemalloc with InnoDB is similar to what I get from tcmalloc.
tl;dr
- For glibc malloc with MyRocks I get OOM on a server with 128G of RAM when the RocksDB buffer pool size is 50G. I might have been able to avoid OOM by using between 30G and 40G for the buffer pool. On that host I normally use jemalloc with MyRocks and a 100G buffer pool.
- With respect to peak RSS
- For InnoDB the peak RSS with all allocators is similar and peak RSS is ~1.06X larger than the InnoDB buffer pool.
- For MyRocks the peak RSS is smallest with jemalloc, slightly larger with tcmalloc and much too large with glibc malloc. For (jemalloc, tcmalloc, glibc malloc) It was (1.22, 1.31, 3.62) times larger than the 10G MyRocks buffer pool. I suspect those ratios would be smaller for jemalloc and tcmalloc had I used an 80G buffer pool.
- For performance, QPS with jemalloc and tcmalloc is slightly better than with glibc malloc
- For InnoDB: [jemalloc, tcmalloc] get [2.5%, 3.5%] more QPS than glibc malloc
- For MyRocks: [jemalloc, tcmalloc] get [5.1%, 3.0%] more QPS than glibc malloc
Prior art
I have several blog posts on using jemalloc with MyRocks.
- October 2015 - MyRocks with glibc malloc, jemalloc and tcmalloc
- April 2017 - Performance for large, concurrent allocations
- April 2018 - RSS for MyRocks with jemalloc vs glibc malloc
- August 2023 - RocksDB and glibc malloc
- September 2023 - A regression in jemalloc 4.4.0 and 4.5.0 (too-large RSS)
- September 2023 - More on the regression in jemalloc 4.4.0 and 4.5.0
- October 2023 - Even more on the regression in jemalloc 4.4.0 and 4.5.0
Builds, configuration and hardware
I compiled upstream MySQL 8.0.40 from source for InnoDB. I also compiled FB MySQL 8.0.32 from source for MyRocks. For FB MySQL I used source as of October 23, 2024 at git hash ba9709c9 with RocksDB 9.7.1.
The server is an
ax162-s from Hetzner with 48 cores (AMD EPYC 9454P), 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 with SW RAID 1 over 2 locally attached NVMe devices. More details on it
are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.
For malloc the server uses:
- glibc
- tcmalloc
- provided by libgoogle-perftools-dev and apt-cache show claims this is version 2.9.1
- enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so in my.cnf
- jemalloc
- provided by libjemalloc-dev and apt-cache show claims this is version 5.2.1-4ubuntu1
- enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libjemalloc.so in my.cnf
The configuration files are here
for InnoDB and
for MyRocks. For InnoDB I used an 80G buffer pool. I tried to use a 50G buffer pool for MyRocks but with glibc malloc there was OOM so I repeated all tests with a 10G buffer pool. I might have been able avoid OOM with MyRocks and glibc malloc by using a between 30G and 40G for MyRocks -- but I didn't want to spend more time figuring that out when the real answer is to use jemalloc or tcmalloc.
Benchmark
I used sysbench and my usage is
explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement.
The tests run with 16 tables and 50M rows/table. There are 256 client threads and each microbenchmark runs for 1200 seconds. Normally I don't run with (client threads / cores) >> 1 but I do so here to create more stress and to copy what I think Dimitri had done.
Normally when I run sysbench I configure it so that the test tables fit in the buffer pool (block cache) but I don't do that here because I want to MyRocks to do IO as allocations per storage read create much drama for the allocator.
The command line to run all tests is: bash r.sh 16 50000000 1200 1200 md2 1 0 256
Peak VSZ and RSS
The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). I am not sure it is fair to compare these ratios between InnoDB and MyRocks from this work because the buffer pool size is so much larger for InnoDB. Regardless, RSS is more than 3X larger than the MyRocks buffer pool size with glibc malloc and that is a problem.
Peak values for InnoDB with 80G buffer pool
alloc VSZ RSS RSS/80
glibc 88.2 86.5 1.08
tcmalloc 88.1 85.3 1.06
jemalloc 91.5 87.0 1.08
Peak values for MyRocks with 10G buffer pool
alloc VSZ RSS RSS/10
glibc 46.1 36.2 3.62
tcmalloc 15.3 13.1 1.31
jemalloc 45.6 12.2 1.22
Performance: InnoDB
From the results here, QPS is mostly similar between tcmalloc and jemalloc but there are a few microbenchmarks where tcmalloc is much better than jemalloc and those are highlighted.
The results for read-only_range=10000 are an outlier (tcmalloc much faster than jemalloc) and from
vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are much larger for jemalloc than for tcmalloc.
These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
Relative to results with glibc malloc
col-1 : results with tcmalloc
col-2 : results with jemalloc
col-1col-2
0.991.02hot-points_range=100
1.051.04point-query_range=100
0.960.99points-covered-pk_range=100
0.980.99points-covered-si_range=100
0.960.99points-notcovered-pk_range=100
0.970.98points-notcovered-si_range=100
0.971.00random-points_range=1000
0.950.99random-points_range=100
0.990.99random-points_range=10
1.041.03range-covered-pk_range=100
1.051.07range-covered-si_range=100
1.041.03range-notcovered-pk_range=100
0.981.00range-notcovered-si_range=100
1.021.02read-only-count_range=1000
1.051.07read-only-distinct_range=1000
1.071.12read-only-order_range=1000
1.281.09read-only_range=10000
1.031.05read-only_range=100
1.051.08read-only_range=10
1.081.07read-only-simple_range=1000
1.041.03read-only-sum_range=1000
1.021.02scan_range=100
1.011.00delete_range=100
1.031.01insert_range=100
1.021.02read-write_range=100
1.031.03read-write_range=10
1.011.02update-index_range=100
1.150.98update-inlist_range=100
1.060.99update-nonindex_range=100
1.031.03update-one_range=100
1.021.01update-zipf_range=100
1.181.05write-only_range=10000
Performance: MyRocks
From the results here, QPS is mostly similar between tcmalloc and jemalloc with a slight advantage for jemalloc but there are a few microbenchmarks where jemalloc is much better than tcmalloc and those are highlighted.
The results for hot-points below are odd (jemalloc is a lot faster than tcmalloc) and from
vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are both much larger for tcmalloc.
These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.
(QPS with $allocator) / (QPS with glibc malloc)
Relative to results with glibc malloc
col-1 : results with tcmalloc
col-2 : results with jemalloc
col-1col-2
0.681.00hot-points_range=100
1.041.04point-query_range=100
1.091.09points-covered-pk_range=100
1.001.09points-covered-si_range=100
1.091.09points-notcovered-pk_range=100
1.101.12points-notcovered-si_range=100
1.081.08random-points_range=1000
1.091.09random-points_range=100
1.051.10random-points_range=10
0.991.07range-covered-pk_range=100
1.011.03range-covered-si_range=100
1.051.09range-notcovered-pk_range=100
1.101.09range-notcovered-si_range=100
1.071.05read-only-count_range=1000
1.001.00read-only-distinct_range=1000
0.981.04read-only-order_range=1000
1.031.03read-only_range=10000
0.961.03read-only_range=100
1.021.04read-only_range=10
0.981.07read-only-simple_range=1000
1.071.09read-only-sum_range=1000
1.021.02scan_range=100
1.051.03delete_range=100
1.111.07insert_range=100
0.960.97read-write_range=100
0.940.95read-write_range=10
1.081.04update-index_range=100
1.081.07update-inlist_range=100
1.091.04update-nonindex_range=100
1.041.04update-one_range=100
1.071.04update-zipf_range=100
1.031.02write-only_range=10000
why does VSZ matter?
ReplyDeleteIt might matter, it might not. It is one more thing to worry about.
DeleteIt might matter because that can be memory that is used and swapped out, maybe there is a memory leak.
It might not because that can just be a clever use of the address space by jemalloc.
Figuring out which of these applies isn't free.
gotcha, thanks. i'm used to running without swap and only considered the latter case
Delete