Showing posts with label performance. Show all posts

Tuesday, August 23, 2022

How I do RocksDB performance tests, part 2

This extends on my previous post. This post isn’t specific to RocksDB. It also has more opinions and might serve as speaker notes were I to write slides. I am writing this prior to speaking to peers about my work so it might have an audience of one (me) but that is typical of many of my posts. Regardless, I appreciate that people read and engage with some of my posts.

Points

How did I get here?

Long ago I worked on DBMS internals full time - I added features, fixed bugs and created bugs. Then I moved to web-scale MySQL at Google and started to spend time in production. Production is a great education but it came at the cost of less time for new features. I still spent much time finding and fixing bugs. After a few years I moved to Facebook and the trend continued. Over time I stopped doing feature development, spent much less time fixing bugs but still spend time reporting bugs. I read a lot of code when trying to explain things, but I don't write much that makes it upstream. I have enjoyed the change. I don't need to write code because I am surrounded by talented developers. I can specialize in my thing, and others specializing in their thing. It is hard to be expert in too many things.

Benchmarks are what you make of them

They are far from perfect but they are quite useful. Testing by comparing things and explaining the results makes them more value. Benchmarks informed by production are even better.

How does code age?

Single-thread performance on CPUs isn't improving like it used to. Long-lived code tends to attract CPU regressions. This combination is a problem. Good regression tests help spot the problems and spotting them early is a big deal because removing them long after they arrive is too expensive. This isn't just a technical problem. How do you reject new features that help a fraction of the user base when the cost if more CPU overhead for all users?

Needs improvement

I hope to get better about using benchmarks that avoid coordinated omission, have more statistical rigor, expand beyond single-node tests, use benchmark workloads that are adaptive and use benchmarks that enforce response time constraints.

Build a network of smart peers

I have been fortunate to have many talented peers. I engage with Postgres experts on Twitter and have also met smart people who work on upstream projects via bug report discussions.

Explain things

Explain your results. But find a balance because explaining everything will slow you down.

Testing an LSM is complicated

Old posts on this are here and here.
The shape of an LSM tree has more variance than the shape of a B-tree. This can be a source of variance in benchmarks, especially in read-heavy ones. While this is still a work in progress there are db_bench commands to make the LSM shape more deterministic (flush memtable, compact L0, compact L1, wait-for-compaction).
Another problem is a test that inherits a non-deterministic amount of compaction debt. If the sequence is: —benchmarks=write-heavy,read-heavy then the read-heavy step might suffer from compaction debt inherited from write-heavy. The impact of reducing this debt during the read-heavy step can vary and produce confusing results for the read-heavy step.
Try to get the LSM tree into a steady state before read-heavy tests. For example, after fillseq there is no overlap between SSTs. After a full compaction there is only one level (or one SST). These are usually not steady states.
For a load + query benchmark it is easy for the LSM (or B-Tree) to not be in a steady state after the load and many benchmarks suffer from this. If the load is done in key order then the PK index won’t be fragmented with a B-Tree and the SSTs won’t overlap with an LSM — which hides some of the overhead that occurs during query processing. When storage is a local attached SSD and the workload is heavy on IO then you need to worry about non-determinism from the SSD — you either want no SSD GC or to get SSD GC into a steady state (by running the test for long enough and having database files that are large enough, something between 50% and 90% of the device capacity).

Make the DBMS unhappy

Find ways to make the DBMS unhappy and see if it falls over. The challenge is that there are more and less realistic ways to make a DBMS fall over. An easy way to make a DBMS unhappy is to provide it with too many concurrent requests, especially a DBMS that doesn’t provide admission control (like RocksDB). But some problems are best fixed elsewhere because fixes have an opportunity cost. It might be nice to have an optional RocksDB API that implements admission control.

Define your goals

Do you care about average throughput or outliers (p99, p99.9, p99.99). I have a post on this. Average throughput is easy to measure but p99 and beyond (p99.9, p99.99) matters in production because outliers determine user experience and capacity planning is based on p99. While single-valued metrics like p99 are easy to share, graphs for throughput over time at 1-second intervals make it easier to spot problems like stalls, cyclic behavior or throughput that degrades over time.

Statistical rigor

Statistical rigor is great but can be expensive. Repeating every benchmark 3 times increases the accuracy of your results at the cost of 3X more HW. I usually get less rigorous statistical rigor because I frequently repeat benchmark runs because I made a mistake or need to measure one more thing. Another way to think of this is: assume B units of HW capacity, each benchmark has a warmup cost of W and runtime of R. Then solve for N in B = N(W+R) where N is the number of times the benchmark is repeated. A larger value for N implies a smaller value for R and the confidence interval is a function of both N and R.

Coordinated omission

Coordinated omission is a real problem. All of the benchmark clients that I use suffer from it, yet they are still useful. Two things prevent me from doing open-loop benchmarks. First, the benchmark clients I use don’t support it and it takes work to implement a new benchmark client and incorporate it into my workflow. Second, an open-loop benchmark takes more work to setup as I need to discover an arrival rate that the DBMS can handle — or I need a more complicated client that can discover it for me. One day I will use open-loop clients.

Response time constraints

The db_bench benchmark client for RocksDB doesn't have an option to use response time constraints (ignore responses slower than X ms). Another problem is computing throughput without a response time constraint. More concurrency usually means more throughput, but it also means worse response time and more response time outliers. Those slow responses should not be counted. Most of the benchmark clients that I use don’t enforce a response time SLA. Such an SLA is more work, you need to select a reasonable value, but I hope to improve with this. I hope to add them to db_bench.

Single node

Most of my testing runs the client and server on the same server. While I prefer to use separate servers for client & server when the DBMS supports it, that introduces the risk of perf variance because I will be sharing the network.

Stable platform

I use HW at work, in the public cloud and my home test cluster. My work HW has value-added services that consume a variable and occasionally significant amount of compute and storage so I am wary of using it for low-concurrency benchmarks. Public cloud HW means I am using a VM and might be sharing compute and storage with noisy neighbors so I found a way to reduce the CPU variance by using the largest number of CPUs for a given instance type and disabling HT. From quick tests with fio there wasn't much variance in the cloud block storage I chose. My home HW is the most stable after I disabled HT and turbo boost. Alas, it is also the least capable — 4 CPUs, 16G of RAM.

Compare things

I rarely test one thing in isolation because the results are hard to interpret. So I do A/B or even A/B/C/D/... testing where these represent different DBMS, different versions of the same DBMS or different configurations for one version of one DBMS.

Measure things

Start with throughput, then add variance, then add CPU, IO and memory. Foreground CPU and IO can remain constant while background CPU and IO change significantly and most DBMS do much work in the background (excluding SQLite which doesn’t have background threads. Don’t forget to watch VSZ/RSS for the DBMS processes because increases there might lead to OOM. Has disk space usage increases because that can lead to out of space errors. When something is slower search top down. Look at iostat metrics to see if IO/query has changed. Look at vmstat to see if CPU/query has changed. Look at vmstat to see if context switches/query has changed (mutex contention?). Normalize your metrics — IO/query, CPU/query, context switches/query. I frequently have scripts running that scrape output from ps and top. To watch for disk space issues I have a script that runs du and ls in a loop during benchmarks.

Summarize things

One practice I have it to create one line performance summaries with useful numbers for throughput, HW (CPU/storage/memory/disk space) usage, normalized HW usage (CPU/query, IO/query). One line summaries make it easy to compare performance when A/B or A/B/C/D/... testing is done. They also make it easy to spot regressions that don't directly hurt throughput but are a concern -- larger RSS for the DBMS process, more disk space used, more CPU consumed by background threads. The summaries also provide a starting point when I try to explain a performance change. An example is here.

Name & archive things

A mistake I have made many times is starting a benchmark, getting interrupted for a week and forgetting an important detail about the benchmark when I return. Naming patterns reduces the change of this. I try to archive the test scripts and command lines via Github. Saving RocksDB LOG files is also important. All of my important scripts are in Github.

Adaptive benchmark clients

I often have to assume things when configuring a benchmark client. The number of threads that db_bench uses for clients is currently fixed. It would be nice to have some benchmarks that increase the request rate or number of request clients over time or until a response time constraint is violated. I currently do this manually and my solution is sad.

Proactive vs reactive

Is it a bug when it has yet to happen in production? That is an interesting question. The answer requires nuance. Some bugs do happen but have yet to be noticed, some bugs might happen and are worth avoiding other bugs just aren't worth fixing. It isn't always easy to classify a bug into one of these groups.

Friday, November 6, 2020

Max row per group, sad answers only

Today I learned that frequently asked questions on StackOverflow get their own tag. The tag greatest-n-per-group is for answers to questions about writing SQL to find the max row per group. By max row I mean the aggregated columns, group by columns and other columns for the row that has the max or min value in a group. By sad answers only I mean there is a lot of confusion about this, StackOverflow has over 3000 posts, and the query is harder to write than it needs to be.

I am writing about SQL rather than SQL engines and I am not an expert on writing SQL queries, but that might be appropriate given that I am writing about something that confuses users and could be easier. My motivation for writing this was a slow query plan for MySQL 8.0.22 while implementing the Time Series Benchmark Suite (TSBS) for MySQL.

Reproduction SQL is here and here and output from MySQL 8.0.22 is here and here. I updated the output for the queries with SHOW STATUS printed after each query.

The easy way

The answer is easy if you only want the aggregated and group by columns:

SELECT MAX(agg_col), gb_col FROM t GROUP BY gb_col

But it isn't easy when you want other columns -- columns that are not used for group by or aggregation. This would be easy had MySQL continued on the ANY_VALUE theme by adding FIRST_VALUE and LAST_VALUE as MongoDB does via the $first and $last aggregation accumulator operators. Well, MySQL has FIRST_VALUE and LAST_VALUE, but for window functions and they don't provide the desired semantics. If they existed with semantics similar to MongoDB then the following query would work and is easy to write:

SELECT MAX(agg_col), gb_col, FIRST_VALUE(other_col) FROM t GROUP BY gb_col

I am not an expert. Perhaps one day I will learn why there isn't an easy way to do this. MySQL docs have a useful page on this type of query. I have yet to try a variant that uses LATERAL.

The less easy ways

There are many ways to write this query. None of them are easy as the example I wrote above that isn't valid SQL. I tried: left join, correlated subquery, uncorrelated subquery and a rank() window function. The performant solutions were uncorrelated subquery and rank() window function. I was surprised that the rank() window function approach was performant because the explain output looked slow. But the runtime and slow log output were OK. By performant I mean a query plan that examines few rows, and loose index scan is an example of that.

This table shows the response time for each query type and the number of rows examined from the slow log. For the window function approach I am confused both by the low value for rows examined and the plan that shows a table scan. I am curious if there is a bug.

Approach	Response time (secs)	Rows examined
Uncorrelated	0.00	20
Window function	0.10	10
Left join	1.07	245750
Correlated	134.11	671170560

Loose index scan

Update - when I published this I claimed the index was on (j DESC, pk) but that was a mistake.

Before I show the queries, my goal is to get a plan that uses the loose index scan optimization. The test table is: create table tq(pk int primary key, j int, k int), there is an index on (j, pk DESC) and a NOT NULL constraint on j. This query uses a loose index scan, however it can't provide the value for the column k. The loose index scan is performant because it fetches one entry from the index per distinct value for (j,pk).

SELECT max(pk), j FROM tq GROUP BY j
EXPLAIN: -> Group aggregate (computed in earlier step): max(tq.pk)
-> Index range scan on tq using index_for_group_by(x) (cost=13.00 rows=10)

Uncorrelated subquery

This plan is performant because t2 is materialized via a loose index scan and the result from that does one point query per distinct value in j.

SELECT t1.pk, t1.j, t1.k

FROM tq t1, (SELECT max(pk) as maxpk, j FROM tq GROUP BY j) t2

WHERE t2.maxpk = t1.pk

EXPLAIN: -> Nested loop inner join
-> Filter: (t2.maxpk is not null)
-> Table scan on t2 (cost=3.62 rows=10)
-> Materialize
-> Group aggregate (computed in earlier step): max(tq.pk)
-> Index range scan on tq using index_for_group_by(x) (cost=13.00 rows=10)
-> Single-row index lookup on t1 using PRIMARY (pk=t2.maxpk) (cost=0.26 rows=1)

Rank window function

This query is slightly slower than the uncorrelated subquery above. I didn't expect that given the plan that has a table scan on tq. Adding hints to use the index on (j,pk) don't change the query plan. I wonder if this explain outpout is correct as the query doesn't do a full scan when run. Also the query is almost as fast as the uncorrelated approach.

WITH t1 AS (SELECT pk, j, k,

RANK() OVER (PARTITION by j ORDER BY pk DESC) AS myrank FROM tq)

SELECT pk, j, k from t1 WHERE myrank=1

EXPLAIN: -> Index lookup on t1 using <auto_key0> (myrank=1)
-> Materialize CTE t1
-> Window aggregate: rank() OVER (PARTITION BY tq.j ORDER BY tq.pk desc )
-> Sort: tq.j, tq.pk DESC (cost=8273.25 rows=82170)
-> Table scan on tq (cost=8273.25 rows=82170)

Left join

I don't expect this query to be performant because there isn't an equality predicate on pk. This might be a useful approach when there isn't an index on (j,pk), but that is not the case here and this plan examines too many rows.

SELECT t1.pk, t1.j, t1.k FROM tq t1

LEFT JOIN tq t2 ON t1.j = t2.j AND t1.pk < t2.pk

WHERE t2.j IS NULL

EXPLAIN: -> Filter: (t2.j is null) (cost=76209940.12 rows=750212100)

-> Nested loop antijoin (cost=76209940.12 rows=750212100)

-> Table scan on t1 (cost=8273.25 rows=82170)

-> Filter: (t1.pk < t2.pk) (cost=14.38 rows=9130)

-> Index lookup on t2 using x (j=t1.j) (cost=14.38 rows=9130)

Correlated subquery

The correlated subquery isn't performant. It examines too many rows. That isn't a surprise.

SELECT pk, j, k FROM tq t1
WHERE pk=(SELECT MAX(t2.pk) FROM tq t2 WHERE t1.j = t2.j)

EXPLAIN: -> Filter: (t1.pk = (select #2)) (cost=8273.25 rows=82170)

-> Table scan on t1 (cost=8273.25 rows=82170)

-> Select #2 (subquery in condition; dependent)

-> Aggregate: max(t2.pk)

-> Index lookup on t2 using x (j=t1.j) (cost=927.37 rows=9130)

Friday, October 23, 2020

LSM local secondary indexes (LSM LSI)

This expands on my previous post about RID-lists for RocksDB. RocksDB doesn't provide secondary indexes, nor does it know about schemas but applications that use RocksDB can add both and MyRocks is an example of that.

Many DBMS use the concept of local vs global secondary indexes and partitioned tables are one place where this matters. Assuming a B-Tree is used for the index, then a local secondary index with a partitioned table has a B-Tree index per-partition. The benefit of this is that DROP PARTITION is fast -- delete the table and index files for that partition. The cost from this is that a secondary index query might have to probe indexes per-partition and this can use more CPU and IO.

Without defining a local secondary index for an LSM I assert that secondary indexes with MyRocks are global. Global secondary index entries for an LSM use the PK value to reference the base row. Side effects of this include larger secondary index entries and CPU overhead when an LSM point read must be done to find the base row. Any DBMS with a clustered PK index, like InnoDB, also has these costs.

For the following I haven't done an extensive literature search to understand whether this idea has been proposed. Regardless, I hope that LSM LSI (see below) is the name to be be used.

Local secondary indexes for an LSM

The concept of local secondary indexes also applies to a Log Structured Merge tree. An LSM global secondary index (LSM GSI) uses the PK value because the base row is in the primary key index and is likely to be relocated over time by compaction. Therefore, either secondary index entries must be updated with a new location (rowid) during compaction of the PK index or the secondary index entries must use the PK value. There are more complex solutions that I will ignore.

There are other costs when secondary index entries uses the PK value rather than a rowid to find the base row -- it is harder to implement RID-lists and bitmap indexes. But I think this can be fixed by an LSM local secondary index (LSM LSI).

I define an LSM LSI to be local to a sorted run. When compaction merges sorted runs into a larger sorted run, then a new LSM LSI will be created for that new sorted run. The cost from this is fanout for queries that have a predicate on the secondary index and cannot be pruned to a few sorted runs. Pruning to a few sorted runs is most likely to be enabled via predicates on the PK.

The lifetime of the LSM LSI is the same as the lifetime of the data it indexes. Therefore, LSM LSI entries can use something like a rowid rather than the PK to reference the base row. This makes it much easier to implement RID-lists and bitmap indexes for an LSM. The rowid has two parts -- (block#, offset) where block# is a value from 0 to N-1 when an SST has N blocks and the offset is the position of the row in the block. The rowid usually needs <= 4 bytes.

Fanout is less of a concern for a bitmap index because many analytic queries expect to scan a large part of the index. But there will be more merging of bitmap index entries when LSM LSI is used, just as there would be with local bitmap indexes on a partitioned table with a famous SQL DBMS.

References

Cassandra SAI (Storage Attached Index) is an LSM local secondary index. It uses a trie for text keys and a kd-tree for numeric keys. I think the index entries use the row offset within the SST as the row pointer.
Cuckoo Index might be relevant, although it is a lighter weight index
There is a paper in SIGMOD 2018 on LSM secondary indexes
There is a paper from the AsterixDB project on LSM secondary indexes
Another paper from AsterixDB and Chen Luo on LSM secondary indexes

Wednesday, October 21, 2020

Self-aware DBMS, RID-lists for LSM trees

This is a short writeup of two ideas I might never have the time to implement -- self-aware DBMS and RID-lists for LSM tree.

Self-aware DBMS

Tuning a B-Tree index structure is hard and tuning an LSM tree is usually harder. Fortunately it will get easier over time. We tune a DBMS today via knobs and there is clever work in progress to automate changing of those knobs. But we can do even better. I hope one day for self-aware DBMS algorithms that understand the impact of their knobs, accept declarative goals & priorities from a user and are able to improve performance with respect to an objective function. Lets move the cleverness into the DBMS.

RID-lists for LSM trees

tl;dr - can I use the merge operator to implement RID-lists for an LSM?

I worked on bitmap indexes at Oracle. It would be fair to call them compressed RID-lists. Many smart people worked on that code in Oracle prior to me and the inventor, Gennady Antoshenkov, has a strong record of innovation. IBM DB2 has RID-lists, but I don't know whether they can be compressed. All of these are useful for analytics whether this is a bitmap index, compressed RID-list or uncompressed RID-list.

Postgres 13 has index deduplication which looks like an uncompressed RID-list and I look forward to new query execution features to do the equivalent of bitmap-AND and bitmap-OR.

But this is about a RID-list for an LSM tree. A RID-list is useful for an index that can have many duplicates for a key -- meaning some secondary indexes and not for a primary index. The LSMs I care about don't know much about schema and secondary index maintenance requires knowledge of schema but I will ignore that for now. MyRocks uses RocksDB for primary and secondary indexes, so this is something that can be solved by an application. WiredTiger supports schemas and secondary indexes without also providing a query language and someone could add the same to RocksDB.

I am curious whether the merge operator can be used to implement RID-lists? Disclaimer - I am far from an expert on the merge operator.

the rowid is the value of the primary key. Secondary index entries for an LSM and for Oracle IOT use the primary key value as the rowid because the base row can be relocated (during compaction with an LSM, during B-Tree leaf node splits and merges with IOT). Bitmap indexes for Oracle IOT solves this via a mapping table. If that is not done then the problem is that the primary key value might be large and a list of them might not be compression friendly.
the delta that is written by the merge operator specifies whether a rowid is added or removed for a given secondary key value. I wonder if something like SingleDelete is needed.
the callback function that combines merge operator entries during compaction will merge entries to create larger RID-lists

Thursday, October 15, 2020

Better vs optimal

I read many systems papers and appreciate how research makes my career more interesting. Some interesting papers use optimal when they mean better in describing performance and I wonder if I am an outlier in thinking that optimal is occasionally misused. Optimal is formal and requires one of:

A proof
An objective function and math to show where it is minimized or maximized
Exhaustive search

In the context of tuning a DBMS, exhaustive search is rarely done because the search space is large. Assume there are N tuning options that each use a B-bit integer, then the search space has N*B bits and N*B quickly becomes a too-big number for all of the systems on which I work.

Objective functions and math are wonderful but they work on performance models and it is a challenge to accurately model a complex system. Proofs share that challenge.

Fortunately, there is clever work in progress to get better results without doing exhaustive search. Alas those results are better, not optimal. And now an unsponsored ad -- I appreciate that OtterTune is not only doing interesting work but they have been careful about the distinction between better and optimal.

Wednesday, October 14, 2020

Comments on: The Unwritten Contract of Solid State Drives

The Unwritten Contract of Solid State Drives by Jun He, Sudarsun Kannan, Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau was published in EuroSys 2017. The paper is interesting and worth reading. I hope for more papers on this topic and recently shared a review on another paper from UW-Madison on this topic. This tries to be a brief review of the paper. I only review papers that I think are useful to me and others.

I didn't notice this paper until recently. While I wish I didn't miss interesting papers like that, it is hard for me to keep up with so many great papers getting published. Regardless, I assume that the RocksDB team is open to discussions with people doing research in this space. I am too and this post was half-serious.

5 rules

The paper presents five rules that are frequently true for an SSD and explains that all rules aren't true for all SSDs. Following the rules often leads to better SSD performance.

The objective functions in the paper are throughput and write amplification. Quality of service (QoS) wasn't included and maximizing throughput can hurt QoS. Many applications want maximum throughput that respects QoS where QoS can be defined by p99 and worst-case response times. So I hope that a future paper includes QoS. The paper explains that it focuses on internal states of the SSD rather than end-to-end performance, but QoS can also be measured via internal states because that is where many stalls happen (read delayed because slower program or erase in progress).

There is also a difference between using all device throughput that a workload needs vs using all of the device throughput. Saturating device throughput isn't a goal. Getting as much throughput as needed for a workload is a goal. A device offers several things -- capacity, throughput and low-latency operations. It is hard to saturate both capacity and throughput while not hurting QoS.

The rules are:

Request scale - issue large requests or concurrent small requests to get more throughput and benefit from internal parallelism of the SSD
Locality - access with locality (temporal and/or spatial). Possible benefits include reduced translation cache misses, reduced data cache misses and reduced write-amp from flash GC
Aligned sequentiality - write files sequentially using large write requests. For an FTL that can do hybrid block mapping this requires fewer entries in the translation table.
Grouping by death time - do writes so that all data in an erase block is discarded at the same time.
Uniform lifetime - structure data so that all writes have a similar lifetime. Note that this is stricter version of grouping by death time and unlikely to be followed by a DBMS except for workloads that are write-once and then read-only.

The paper states that an SSD might not require following all of the rules. Alas, it can be difficult to determine which rules should be followed and for a given device, or even a given firmware version.

A summary of the Linux block IO layer is useful background reading.

5 Rules and an LSM

Can an LSM follow these rules?

Request scale

Compaction reads with RocksDB are a (compressed) block at a time and the uncompressed block size is likely ~4k, ~8k or ~16k. The block sizes are approximate, usually < the target size and not aligned. The compaction_readahead_size option in RocksDB can be set to get large read requests otherwise you rely on filesystem readahead and it is risky to make read_ahead_kb large because that is per-device.
Compaction writes with RocksDB are a page at a time. Each SST is written sequentially and fsync is called when the SST is full (usually ~64M). Async write-behind can be enabled via the bytes_per_sync to avoid a storm of writes on fsync and to get large write requests of size bytes_per_sync.
WAL writes can use the wal_bytes_per_sync option
For user reads, many concurrent users are the easy way to get concurrent small reads. I am less certain about options for doing prefetch or getting parallel IO requests for a user query. Back when I did small data at FB getting concurrent IO for one user request wasn't a priority because the focus was sharing a device across many users.

Locality

Compaction reads & writes respect this as they read & write files sequentially.
Whether user reads respect this depends on the workload.

Aligned sequentiality - compaction writes respect this
Grouping by death time

I know this via the multistream effort and even collaborated on an evaluation. The results in my eval weren't significant so I should read papers by others to understand where it was a big deal.
Grouping by space requires an API like multistream. With that it is easy to group for an LSM. The upper/smaller levels of the LSM tree have shorter lifetimes than the lower/larger levels.
It is usually impossible to do this without an API. Because I don't know how many erase blocks are concurrently open. Nor do I know the size of an erase block. I risk using the wrong name as I am not an expert on this but I mean the logical erase block that is managed by the FTL not the physical erase block that is per NAND chip. The logical erase block is striped over many NAND chips for a few reasons, including to prevent the loss of data when a NAND chip fails. This structure is rarely explained by vendors.

Uniform lifetime - I doubt any DBMS can follow this except for data that is write once.

Workloads

The paper then runs workloads for SQLite, LevelDB and RocksDB using SSD simulators to determine whether the rules are followed. The workloads used db_bench for RocksDB however there weren't enough details on the command line options or RocksDB configuration. One of the comments mentions that most of the activity was between the L0 and L1 in the LSM tree and I wonder if the databases were too small and didn't have enough LSM tree levels.

Observations

The paper then has many interesting observations on whether the rules were respected and explains cases where they weren't. Things that prevent the rules from being followed include the DBMS, the filesystem and the workload. I have comments on some of the observations

Observation 1 - Application log structure increases the scale of write size. This is expected with an LSM. I appreciate that this paper distinguishes between IO patterns for the application (LSM writes SST files sequentially) and for the device (concurrent compaction means that a device interleaves large write requests from multiple files).
Observation 2 - The scale of read requests is low. It is hard to know whether this is a function of the system software (bottlenecks & missing features) or the workload (working set cached, not many concurrent users).
Observation 3 - RocksDB doesn't consume enough device bandwidth. This is hard to appreciate without more details on the benchmark setup. Note that I don't care about saturating device throughput unless QoS is also a goal.
Observation 4 - Frequent data barriers limit request scale. The obvious one is fsync/fdatasync although concurrency usually overcomes that. For reads, prefetching and readahead can overcome but this wasn't a priority for the workloads I cared about because they had many concurrent users. There are things that can be done, and might already be in RocksDB, to get more concurrent IO reads to a device when apps have low concurrency.
Observation 5 - Linux buffered I/O limits request scale. Fortunately, RocksDB can also run with O_DIRECT. I wasn't clear on the claim that read_ahead_kb limits the size of a read. Perhaps this is true for reads smaller than read_ahead_kb. Perhaps I can ask the block IO expert. For compaction, the compaction_readahead_size option can get larger reads for compaction. For scans I am not sure if RocksDB has a way to get readahead (multi-page reads) when needed.
Observation 8 - be wary of deferring discard. The variety of discard performance across devices is large and not well documented. I hope more research is done on this.
Observation 9 - SSDs demand accurate and aggressive prefetching. I am wary of inaccurate and aggressive prefetching. An LSM can benefit from prefetching for compaction reads (RocksDB has an option for that). Prefetching for an LSM during user reads is interesting. For a point query prefetching the lower levels of the LSM tree can be wasted work when the data is found in the upper levels of the tree. Prefetching for a scan is easier because data from all levels must be read. However, the amount of readahead to use during a scan is more complicated because the LSM usually isn't told how large the scan will be (first 10 rows vs all rows). There might be options in RocksDB to get parallel reads on behalf of a single user query.
Observation 17 - application log structuring does not reduce garbage collection. I am skeptical about this. Also, the paper shows that it doesn't prevent GC. It does not show that it doesn't reduce it. I have results from 2016 to show that flash GC write-amp from MyRocks was less than from InnoDB. Of course, that is specific to one device. Regardless I appreciate that the paper explains some of the issues including that a device interleaves large write requests from many files with concurrent compaction. I am wary of the misleading sequential writes message that is repeated in too many papers.

Thursday, October 8, 2020

Comments on: Optimizing Databases by Learning Hidden Parameters of Solid State Drives

I just read an interesting paper: Optimizing Databases by Learning Hidden Parameters of Solid State Drives by Aarati Kakaraparthy, Jignesh Patel, Kwanghyun Park and Brian Kroth. Reading the paper was a good use of my time. I hope there are more papers on this topic. This isn't a review of the paper but a list of questions and comments on the paper. The paper is well written and easy to read, so a review by me wouldn't add much.

Storage Performance

The reason for the paper is that the performance profile of an SSD is complicated and complicated in different ways than the performance profile of a disk. It might be easy to predict throughput from a disk given a transfer rate, rotational and seek latency. But that gets a more complicated when using the outer vs inner tracks of the disk. And it gets even more complicated with concurrency and the impact of the IO scheduler.

Predictions for a disk are easy but complicated (that might sound odd) but predictions for an SSD are less easy and more complicated and many of the details aren't explained by the SSD vendors. The paper explains how to derive some of the design decisions via testing and then uses that to improve performance for MariaDB and SQLite.

If you care about performance an SSD is a black-box. SSD vendors might feel the same way about the workloads that run on SSDs. I wrote innosim (InnoDB IO simulator) a long time ago to share a workload that I care about with SSD vendors. It is fun to see results from it on the web.

The paper is about direct attach SSD. I wonder if there is an opportunity for interesting research on cloud SSD. With EBS requests <= 256KB count as one IO so you will get more throughput if you can figure out how to use larger requests.

Comments on the paper

It is possible that I missed a few things in the paper, and I am happy to be corrected.

The desirable write request size is determined by creating files using writes of size X KB (X <= 512) and then computing the read latency for reads of size 1MB per the pseudo-code for Experiment 1. The value of X that provides the minimum read latency is the desirable write request size.

The desirable size is likely to be much larger than 16kb but if you are using an update-in-place B-Tree for OLTP then you are unlikely to do that as leaf pages will be small (<= 16kb). Write requests can be larger with an LSM (SST files are > 1MB) or a copy-on-write B-Tree.
I am not sure about using min latency for 1MB reads as the objective function to determine the best write request size. It is unlikely that 1MB reads will be done from a B-Tree. With an LSM that will occur during compaction, although I am not sure that optimizing for compaction reads is a goal. Large reads are common with a heap-organized table (Postgres), especially when doing analytics. So I think more nuance is needed and that could be included in another paper.
What is the impact from concurrent writes? For InnoDB either page writeback or doublewrite IO will be concurrent with binlog or redo IO. For MyRocks compaction writes will be concurrent with binlog or WAL IO. Assuming it were possible to use the desirable write request size, how is this impacted by concurrent writes because requests can be interleaved by the block IO layer and by the SSD. Note that if the SSD erase block size is <= 64MB then it might be nice to let an LSM not share erase blocks between SST files as not sharing means there is nothing to copy-out of the erase block during GC (all or none of the erase block is in use). I probably should re-read this document once a year as Jens Axboe does a great job explaining what happens in the block IO layer for large requests.
What is the impact from GC? Writes done today can be relocated tomorrow by flash GC.

I hope more is written to explain the difference between desirable write request size and stripe size and now I will try to read other papers to better understand stripe size. From table 1 they are equal for 2 of the 4 devices tested. They aren't equal for SSD-S and I think that was a judgement call based on the graphs in figure 3.

The paper didn't try to discover the desirable concurrency level (number of channels) as that has been addressed by previous work. But I hope they include that in future work.

IO patterns and optimization goals vary by file structure. For MySQL with InnoDB the important IO is listed below. Given that reading redo and binlogs from storage aren't (or shouldn't be) the common case the desirable write request size for them is one that minimizes write-amp:

InnoDB database files - writes of size 16 KB (sometimes 8KB), uses either buffered IO or O_DIRECT, single-page reads and writes are frequent
InnoDB undo - I will let someone with more expertise summarize this
InnoDB redo - writes of size 512*x, frequent fsync, buffered IO, file is pre-allocated, rarely read. We suffered from redo read IO back in the day for InnoDB redo when it didn't stay in cache (large redo isn't free). The problem was that writing the first 512 bytes of a file system page not in cache turns into read-modify-write. I don't know the status of this today.
InnoDB doublewrite buffer - large writes, probably a small multiple of 512 KB to the same location in the system tablespace, uses either buffered IO or O_DIRECT. This might have changed recently in MySQL 8.
binlog - small writes of any size, frequent fsync, file grows with each write, reads are usually done from the OS page cache, not from storage

TRIM is another black box and a topic for future work. TRIM performance varies across devices. I was spoiled by early Fusion IO devices where TRIM was fast. Modern devices have differing capacities for the number of files and number of MB that can be trimmed per second. There will be stalls when those rates are exceeded. Alas, these capacities are not explained. As a workaround, R.ocksDB has an option to rate limit file deletion. Domas Mituzas has written about and shared code to manage file deletion done by value-added processes that run on DBMS HW to keep them from hurting DBMS QoS.

Thursday, September 17, 2020

Performance results and DBMS experts

I used to assume that many performance results couldn't be trusted because the people sharing the results lacked expertise in the system(s) under test. I am still wary of results in conference papers that have results for too many DBMS and my wariness is a function of the number of DBMS considered.

But my opinion on this has changed. Now I prefer to focus on the results that the typical performance-sensitive deployment will get. By performance-sensitive I mean a deployment that cares enough about performance to spend some money and/or time to get it. This isn't a deployment that uses a default configuration but it might not be able to afford full time performance teams and months of tuning.

There are far more deployments than experts, so it is safe to assume that expertise is in short supply. If we are to assign blame then I prefer to blame the DBMS for being too complex, having too many sharp edges and not being adaptive.

Lack of expertise in a DBMS isn't a mark of shame. Resources are finite (time to learn, brain capacity). Some of us can afford to fill our heads with DBMS trivia at the expense of learning other useful skills. But most people need to focus on things other than the DBMS to get their jobs done.

There is a big opportunity to build systems that provide better performance without excessive tuning. While ML seems to be the primary motivator for current research, I expect non black-box approaches to deliver interesting results as well. And by non black-box approaches I mean designing algorithms that consider performance goals and know how to adjust to achieve them. This assumes they have some understanding of performance models. Can we call these self-aware index structures?

Managed databases are also an opportunity for scaling expertise. The service provider has the expertise and has a better chance of applying it across their many customers.

Friday, September 4, 2020

When is it OK to lose a commit?

I am thrilled that OSS DBMS are better at avoiding the loss of commits thanks to replication that can do more than async. But there are good reasons for some workloads to not pay the cost (in latency) from such features.

By losing a commit here I mean the loss of the most recent N commits, not the lost of any commit from the past. By workload I mean one of the applications using the DBMS and there can be a variety of workloads using the DBMS. For some DBMS, all applications must use the same level of commit durability. But that is an implementation artifact, not something fundamental to the algorithms we implement.

A workload for which it is OK to lose the most recent N commits is inserting into a DBMS from a log file. If this has limited concurrency then commit latency can be a problem. If commits are lost then they can be replayed. This assumes that the application 1) can determine when a commit has been lost and 2) figure out where to begin the replay.

To determine when a commit has been lost the application queries DBMS state while connecting and while reconnecting. One reason to reconnect is after a failover and a tradeoff for faster commit is the potential for losing the last N commits on failover.

In the simple case there is one single-threaded application doing inserts from a log file into one DBMS table. It is feasible but not trivial to make sure there really is at most one application running in this case -- because distributed system challenges extend to applications. But I ignore that complexity for now.

The DBMS state is found in either the table to which the inserts are done or a separate state table. If there is a way to find the last inserted row, then that tells you where to restart, assuming you can map the value of that row back to an offset in the source (log). Otherwise, a separate state table can maintain the log offset from which inserts should continue and the state table is updated in the transaction that does bulk inserts from the log.

Update

I just remembered something that I never understood. Ingres Replicator does async replication via 2pc between the primary and a secondary -- see here and here. I never understood why. That is an expensive way to replay a log (from the primary) on a secondary. The alternative is for a secondary to track its state independent of the primary, replay transactions on a secondary (commit locally on the secondary, update state table on the secondary) and some time after commit the state (queue of transactions to be replayed) can be updated (truncated).

Thursday, August 20, 2020

iostat output format changes again

The format for iostat differs across the last 3 Ubuntu LTS (16.04, 18.04, 20.04). Some of my scripts scrape that output and now it is time to update them to detect the new format. I wrote about this before. There are other ways to get this data so this is a risk I accept for being lazy. The changes are from added columns, removed columns, renamed columns and reordered columns. I wish the last two weren't done. There are a few more projects out there that scrape iostat output and must update how they parse it.

In Ubuntu 20.04 the new columns are for trim, which is useful. The svctm column was removed and fields were reordered. I assume that the bytes written metrics no longer include bytes trimmed -- which inflated by 2X the write rate for anything that does frequent unlike (like RocksDB).

New fields in 20.04 are:

d/s - number (after merges) of discard requests completed per second for the device
dsec/s (dkB/s, dMB/s) - number of sectors (KB, MB) discarded for the device per second
drqm/s - number of discard requests merged per second that were queued to the device
%drqm - percentage of discard requests merged together before being sent to the device
dareq-sz - average size (in kilobytes) of the discard requests that were issued to the device
d_await - average time (in milliseconds) for discard requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them

Sunday, August 16, 2020

Review of PrismDB: Read-aware LSM Tree for Heterogeneous Storage

The PrismDB paper explains how to use heterogenous storage with an LSM. This is my review of the paper and I don't review papers that aren't interesting and worth reading. I might be biased given that I worked on RocksDB, MyRocks, helped with the persistent read cache for tiered storage in RocksDB for which there is an interesting paper. For the rest of this post I use the word tiered instead of heterogenous as that is easier to type.

The paper is about one of my favorite topics, efficient performance, where the goal is to get good enough performance and then optimize for efficiency. In this case the goal is to use mostly lower-perf, lower cost storage (QLC NAND flash) with some higher-perf, higher-cost storage (NVM, SLC/TLC NAND). The simple solution to use NVM for the top levels (L0, L1), TLC for the middle levels and QLC for the max level of the LSM tree. Alas, that doesn't always work out great as the paper shows.

An LSM tree with leveled compaction organizes keys by write recency. When n < m then a key/value pair in Ln is likely to have been written more recently than a key in Lm. But it isn't clear that location in the LSM tree suggests something about read recency or frequency. Thus the simple tiered storage approach mentioned in the previous paragraph might not help much for reads.

Paper summary

The idea in the paper is to pin hot key/value pairs higher in the LSM tree and then use higher-perf, higher-cost storage for those levels. Thus, storage read latency for such data will be reduced. This is implemented via three components: tracker, mapper, placer.

The tracker identifies keys that are frequently accessed. This is done via CLOCK with 2 bits/key and implemented via a concurrent hash map.
The mapper decides which keys are hot. The mapper can achieve the goal of at most X% of keys on a given level getting pinned. Given such a constraint the mapper will determine which keys can get pinned in that level. The mapper uses the information from the concurrent hash map managed by the tracker.
The placer implements pinned compaction, explained below, to implement the decisions made by the mapper.

Pinned compaction is a new compaction strategy. For leveled compaction from Ln to Ln+1 pinned compaction will keep keys on Ln when such keys are listed for pinning on that level. Pinned compaction will also pull up keys form Ln+1 to Ln for the same reason.

Pinned compaction benefits from weighted scores that determine which SSTs are most likely to benefit from pinned compaction (have keys that should be pinned). The scores determine which SSTs to use as input for the next round of pinned compaction.

Access distributions

My biggest concern about the paper is whether production access patterns will benefit form this approach. Academic work on this topic has been hampered because not enough has been published on this topic. It is hard to work on cache management without knowing the access distributions.

The paper has two claims on this point:

Objects stored in the higher levels of the LSM are read and updated much more often than objects stored at the bottom layers.
In addition, since they are updated much less frequently, these lower levels can meet the lower endurance requirements of cheaper flash storage.

I agree that objects in the higher levels are likely to be updated more often. I am less certain that this extends to reads. Most of the benchmarks from the paper use YCSB with Zipfian and the claims will definitely be true in that case. The claim about reads is more likely to be true for workloads with many range reads especially when prefix bloom filters cannot be used.

For point #2, my experience with production and benchmark workloads has been that per-level compaction write rates are similar (MB/s written to storage per level). But this can be true without contradicting claim 2 above as the larger levels have more objects. Regardless the per-level write rates must be considered to make sure a device will meet the desired lifetime.

By similar write rates I mean they tend to be within a factor of 2 and here is an example from Linkbench -- see the Write(GB) column.

Deployment

I know that conference papers don't have to solve my production concerns, but I want to share more background on the topic.

How can I deploy this in production? If I am a cloud customer then my storage form factors are different from on-prem. For AWS I can choose from ephemeral, EBS disk, EBS SSD, EBS SSD with PIOPs and S3. I probably don't have to worry about SSD endurance in the cloud but if my workload does many writes then I have to pay for that except when using ephemeral storage. In the cloud there are still different price & performance tradeoffs and I assume work like PrismDB is relevant there.

For on-prem it isn't easy to get multiple storage devices (NVM & TLC & QLC) into one server as web-scale operators try to reduce the number of items on the menu. Even if they allow this the question is whether I can get something with small amounts of NVM per server. Disaggregated storage might solve the small amounts of storage per server problem but with disagg I have to worry about new failure modes when all servers in a rack are sharing access to the same NVM device. I think the most likely solution is for storage vendors to provide a hybrid device that includes some NVM with more MLC/TLC NAND and a lot of QLC NAND (they already do this to some extent).

Assorted comments

Does the mapper create a list of keys (to be pinned) per level or is there a global list?
Does pinned compaction risk creating SSTs with a small number of keys? I'd rather not have an LSM tree with too many small SSTs.
Do the decisions about pinning avoid the double caching problem? I am not sure I need to pin objects that are likely to be in the block cache.
Is the CPU overhead of searching the to-be-pinned data structure significant? This might depend on whether pinned compaction is the only type of compaction run. The paper suggests regular compaction is also run.
For the two points above, using a bloom filter per level for the keys to be pinned might help.
Is there a mechanism for unpin? If regular compaction is still run then is that the way in which pinned keys get unpinned. Or does recomputing the to-be-pinned list over time server to unpin keys that used to be hot and now are cold.
Logical backup and long scans for batch queries are things that can interfere with the computation of the to-be-pinned list. In MyRocks hints can be used as part of a SELECT statement to suggest that data for the query should not pulled into the block cache.

Thursday, August 6, 2020

Over fetching in a DBMS

By over fetching I mean fetching irrelevant documents or fields while processing a query. By fetching I mean the data examined by the DBMS, not the result set returned to the user. By irrelevant I mean documents and fields that won't change the query result if they don't exist. This also applies to a SQL DBMS after substituting column for field, row for document and collection for table. In the rest of this post I mostly use MongoDB names (documents, fields and collections). I might reuse some of the terminology from a post on predicates from Use The Index, Luke. That post is interesting.

I can refine the over fetching definition into four parts:

co-located
single-table
join
aggregation
other

Co-located

Co-located over fetching occurs when irrelevant fields are read from a document or indexes. Columnar storage is a one way to avoid this for analytic workloads. Sometimes a covering index can reduce this for OLTP workloads.

For a collection C with fields a1, a2, ..., aN then db.C.find({}, {a1:1, a2:1, _id:0}) does co-located over fetching when a collection scan is used. An index on (a1, a2) avoids over fetching because the fields a3 ... aN are not in the index. The SQL version of the query is select a1, a2 from C.

Co-located over fetching also occurs when an index has fields that are needed by the query. For the query in the previous paragraph if there is an index on (a1, a2, a3, a4) and the index is used for the query then over fetching occurs because a3 and a4 are read by the DBMS but not needed.

Single-table

Single-table over fetching occurs when irrelevant documents are read. A good index is one way to avoid this, especially for OLTP workloads.

For this example the collection C has the documents:

{ _id: 0, a1:7, a2:8, a3:9 }
{ _id: 1, a1:7, a2:18, a3:19 }
{ _id: 2, a1:27, a2:28, a3:29 }

The query db.C.find({a1: 7}) does single-table over fetching when there isn't an index on a1. In that case a collection scan is done that examines all docs but the doc with _id:2 is irrelevant. With an index on a1 then only the docs with _id:0 and _id:1 are examined and there is no over fetching. The SQL version of this query is select * from C where a1=7.

Single-table over fetching can occur with indexes. For the query db.C.find({ a1: 7, a2: 8}) with an index on a1 then docs with _id:0 and _id:1 are examined but _id:1 is irrelevant because the predicate on a2 excludes it. An index on (a1, a2) avoids the over fetching. The SQL version of this query is select * from C where a1=7 and a2=8.

Join

Join over fetching occurs when join predicates filter documents. The docs that were filtered by the join predicate don't change the query result and figuring out how to avoid examining them prior to the join might improve performance.

With MongoDB $lookup is a left outer join and over fetching cannot occur for the input documents. But it can occur for docs in the from collection when there is no index, or no good index, on it. But explain doesn't show the access path for the from collection -- until SERVER-22622 is fixed.

Whether join over fetching occurs in SQL depends on the join (inner, left outer, right outer, etc). The example uses tables C1 and C2 and the query select * from C1, C2 where C1.x1 = C2.y1

C1 has the rows: (x1:1, x2:2), (x1:11, x2:12), (x1:21, x2:22)
C2 has the rows: (y1:1, y2:2), (y1:11, y2:12), (y1:31, y2:32)

Assume this is evaluated by scanning C1 and then probing C2 (nested loops join). If there is no index on C2.y1 then join over fetching occurs for C2 because (y1:31, y2:32) is examined by a table scan but filtered by the join predicate. Join over fetching occurs for C1 whether or not there is an index on C2.y1 because (x1:21, x2:22) is examined but filtered by the join predicate.

Aggregation

Aggregation over fetching occurs when docs are filtered by the aggregation operator semantics. The obvious ones are $max and $min in MongoDB (max and min in SQL).

For this example the collection C has the documents { _id: 0, a1:7, a2:8 }, { _id:1, a1:7, a2:9 } and the queries are:

MongoDB: db.C.aggregate([ { $group : { _id: "$a1", maxval : { $max : "$a2" } } } ])
SQL: select max(a2), a1 from C group by a1

Aggregation over fetching occurs for the doc with _id:0 because it doesn't have the max value of a2 for the group with a1:7. An index on (a1, a2) can avoid the aggregation over fetching but you should consult the DBMS documentation to understand whether that optimization has been implemented. MySQL has loose index scan for min/max and distinct, Postgres has recursive CTE (here and here) while MongoDB has DISTINCT_SCAN (for $first but not for $min or $max - see SERVER-40090).

Other

Sometimes index-only queries aren't as index-only as you want them to be. Postgres relies on bits being set in the visibility map or it will fetch the base row from the heap. InnoDB relies on a different mechanism but there are cases where it too must fetch base row images from the PK (clustered) index for queries that appear to be index only.

Wednesday, July 22, 2020

Sustaining high insert rates despite secondary indexes

If you want to sustain high insert rates in the presence of secondary indexes then there are two approaches at a high level. The first is to buy enough RAM to cache the entire database but that gets expensive. The second is to avoid storage reads during secondary index maintenance.

So now the problem shifts to methods for avoiding storage reads during secondary index maintenance and there are a few popular ways to do that.

Partition the database so that only a few of the partitions get inserts at any point in time and then use local (per-partition) secondary indexes. If the partitions are small enough they will fit in memory and then traditional indexing (B-Tree) can be used and the per-partition secondary indexes will fit it memory for the partitions that are getting inserts. TimescaleDB is an example of this.
Use an LSM, or something like an LSM, that supports read-free index maintenance for non-unique secondary indexes. MyRocks is an example of that.
Don't use secondary indexes because the target workloads expect to scan large amounts of data

The interesting thing about the first approach is that you end up with a tree per partition and then a query that uses the per-partition secondary index might have to search many trees (fanout or broadcast queries) unless there are other query predicates that enable enough partitions to be pruned.

Searching trees of trees sounds a lot like what an LSM does, which is a nice way of stating that the CPU overhead of an LSM for reads can be reproduced by the first approach. In an LSM with leveled compaction there is a sorted run (or tree) per level. However an LSM benefits from a bloom filter for point queries and might benefit from a prefix bloom filter for range queries. There are also several interesting research papers that show how to use something like a bloom filter for range queries -- see papers on SuRF and Rosetta. I am not sure whether systems that use the first approach (partition with local secondary indexes) also provide something like a bloom filter to match what an LSM can do. But if most partitions eventually become read-only then there are opportunities for being clever.

Updates

this post has more detail on avoiding reads in a write-optimized index structure -- RocksDB merge operator, InnoDB change buffer and more
I would put things like Postgres BRIN (min/max values per block) into the third category (do scans but more efficiently)

Postgres, vacuum and the insert benchmark

I have been running benchmarks with Postgres for a few months and am slowly learning enough about Postgres to trust the results that I get. When I first read about Postgres vacuum I didn't think it would be important for the insert benchmark because the workload is insert-only. This post explains why I was wrong.

Vacuum arrived in the original POSTGRES storage system in the 1980s. Today it serves several purposes: 1) reclaim space for old versions of rows that are no longer visible, 2) update statistics for the query planner, 3) update bits in the visibility map to help index-only queries and 4) preventing transactionID wraparound.

I didn't expect vacuum to help much with the insert benchmark because it doesn't do updates or deletes and there is no space for vacuum to reclaim. Also, query plans were OK and there weren't enough transactions for wraparound to be an issue.

What I missed was the impact of the visibility map on the index-only range scans run during the insert benchmark. These bits are only set by vacuum and if the bit is not set for a heap page, then the base row must be read from the heap page during an index-only scan for any index entries that point to that heap page. The impact from reading the base row is significant when the working set is not cached but still visible when the working set is cached.

Fortunately, the insert benchmark pattern of inserts seems to be friendly to Postgres in that a heap page will become full after some inserts and then doesn't get more inserts. So once the visibility map bit gets set it remains set for that heap page.

Details

Autovacuum is triggered for a table based on the number of updates and deletes done to the table. So while an insert-only workload can benefit from vacuum when it also does range scans, the inserts won't trigger vacuum. I haven't read the code to confirm that inserts don't trigger autovacuum. The main docs aren't clear as the routine vacuuming section states that When enabled, autovacuum checks for tables that have had a large number of inserted, updated or deleted tuples. But the docs for autovacuum_vacuum threshold state that it is only updates and deletes: Specifies the minimum number of updated or deleted tuples needed to trigger a VACUUM in any one table.

I now run the insert benchmark as a sequence of steps: load some data with only PK indexes, create secondary indexes, load more data, then read+write. The read+write step does short range scans as fast as possible and the writes are rate-limited inserts where there is a target on the insert rate per second. As part of this I optionally run vacuum after creating the secondary indexes. Each read+write test runs for 1 hour and vacuum is started at the end of each hour. While my scripts wait for vacuum to finish after creating the indexes, and don't wait for it to finish during the read+write tests that doesn't matter much because vacuum is fast during the insert benchmark.

Vacuum internals

This is a brief description. Hopefully it is truthy. The key point is that vacuum can require full index scans for every index of the vacuumed table. That can take a long time. Alas, this isn't an issue for the insert benchmark because it is insert-only and doesn't create old versions of rows so there are no full scans of all indexes during autovacuum.

The Internals of Postgres web-site is a great place to start after reading the online Postgres docs if you want to learn more -- see chapter 5 and chapter 6. Note that chapter 6 of the Internals site lists steps 3 before step 2 and I need to figure that out before I write too many more blog posts on this topic. There are many useful blog posts on this topic including this blog post.

I don't think vacuum in original POSTGRES did full index scans. I have read some of the discussion that explains why a full index scan is done today but I didn't save links to that. My preference is for vacuum to do index probes rather than full index scans to reduce the time between consecutive vacuums for the same table.

Regardless let me explain some of the work that vacuum can do to make it obvious that vacuum for a table can take a long time, and does when I run Linkbench.

Scan heap pages for which visibility map bits are not set. From each heap page if a row is an old version (because delete or update) and not visible to any current or future transaction then copy the row's CTID (into an array?)
If CTIDs were found in the previous step then scan all secondary indexes for this table and for each index entry determine whether its CTID is in the array from step 1. If yes then reclaim the space for that index entry.
Repeat step 1 but this time reclaim the space for the dead rows.

Finally, the memory used to buffer CTIDs is determined by maintenance_work_mem. If that limit is reached then the steps (and full index scans) are repeated.

Friday, July 17, 2020

Review of -- TimescaleDB: SQL made scalable for time-series data

This is a short review of TimescaleDB: SQL made scalable for time-series data. My focus is on indexing and I ignore many interesting things described in that paper. My goal in reading it was to understand possible optimizations for workloads like the insert benchmark v3. I also read some of the online TimescaleDB docs. This blog post is truthy as I am not an expert on TimescaleDB. Hopefully I haven't made too many mistakes. The paper is several years old and TimescaleDB has improved a lot since then.

The paper does a great job asserting what they know to be true about the workloads for which TimescaleDB is targeted.

Assertion 1:

Time-series data is largely immutable. New data continually arrives, typically corresponding to the latest time periods. In other words, writes primarily occur as new inserts, not as updates to existing rows. Further, while the database needs to be able to support backfill for delayed data, writes are made primarily to recent time intervals.

Assertion 2:

Workloads have a natural partitioning across both time and space. Writes typically are made to the latest time interval(s) and across the “partitioning key” in the space dimension (e.g., data sources, devices, users, etc.). Queries typically ask questions about a specific time series or data source, or across many data sources constrained to some time interval. Yet the queries might not be limited to a particular metric, but may regularly select multiple metrics at once (or use predicates that rely on multiple metrics).

Storage
Rows are stored in hypertables rather than tables and hypertables are composed of chunks. A hypertable must have a time column and can have a partition column. The paper states that the partition column is required for clustered TimescaleDB. But the online docs have more nuance.

Chunks are right-sized (their name for it) to fit in memory. Without a partition column the hypertable is range partitioned on time into chunks. With the partition column the data is distributed by hash on the partition column and then range on the time column. In SQL DBMS partitioning terminology this is composite partitioning with: hash(partition), range(time).

But this isn't traditional partitioning because it is dynamic and automatic. Nobody has to run DDL to add, drop and change partitions. That is one way they add value.
Compression has been added since the paper was published. It is described here. I have yet to read that post but assume that old chunks are compressed while chunks still getting inserts remain in row format.
Secondary indexes are local to a chunk. I assume that means that fanout over (too) many chunks can happen when a query doesn't have a predicate on the partition column. But it also means that if the chunk fits in memory, then secondary index maintenance is not delayed by reading from storage and it is easier to sustain high ingest rates.

Challenges

What is a good indexing strategy for a simple time series workload where new data has four attributes: time, deviceID, metricID, metricValue? I write about this in my post on a replacement for the insert benchmark. Assume that I will use a partition column in addition to the time column. My choices with TimescaleDB are:

partition on deviceID and create local secondary indexes on metricID
partition on metricID and create local secondary indexes on deviceID

I am not sure what I would do to make the min/max query fast as explained in the section on Physical Schema - Read Optimized.

Edits

I quickly read the post on columnar compression and I am impressed they made this work in the context of Postgres. They use TOAST to store compressed columnar chunks off-page so IO is only done for the target columns. In terms of being surprised that this was feasible, it reminds me of the work to support columnar in SQL Server.

Wednesday, July 15, 2020

Indexing and write-heavy workloads

When I see impressive numbers for the insert rate that a DBMS can sustain I wonder what indexes exist and whether the inserts are in sequential or random order with respect to each index. One way to explain this is in terms of the numbers of points in the index at which the inserts occur. Although I use streams rather than insert points in what follows.

I am writing this in part so that I can reference this post in future performance reports when describing workloads. It isn't sufficient to state that inserts are in PK order. They can be in ascending or descending PK order. When ascending the point at which the inserts are done can be at the right end of the index (inserted keys > than existing keys) or somewhere in the middle of the index. When descending the inserts can be done at the left end of the index (inserted keys < existing keys) or somewhere in the middle of the index.

Explaining insert patterns

There are four attributes per index that can explain such insert patterns. The attributes are:

nAsc - number of streams for which inserts occur in ascending order WRT the index
nDesc - number of streams for which inserts occur in descending order WRT the index
nLHS - the number of descending streams that are at the left end of the index
nRHS - the number of ascending streams that are at the right end of the index

Constraints:

nAsc >= 0, nDesc >= 0 and (nAsc + nDesc) >= 1
nLHS and nRHS must be 0 or 1
if nLHS is 1 then nDesc must be >= 1
if nRHS is 1 then nAsc must be >= 1

There is one exception. When the insert pattern is random WRT the index then inf is used instead of the four attributes.

Geek Code

This is a geek code for explaining insert patterns. The attributes are specified per index. When there is only a PK index, named pk, and inserts occur in PK order at the right end of the index (right growing) then the geek code is:

pk=(nAsc:1, nDesc:0, nLHS:0, nRHS:1)

When there is only a PK index but the inserts are in random order WRT the PK then the geek code is:

pk=(inf)

To improve readability I omit attributes for which the value is 0. So these mean the same thing:

pk=(nAsc:1, nDesc:0, nLHS:0, nRHS:1)
pk=(nAsc:1, nRHS:1)

Why?

I am interested in this for three reasons. First, index maintenance has a big impact on insert performance whether or not the working set is in memory. Second, there are optimizations that a DBMS can do for some insert patterns and I suspect there is room for even more optimizations. Many storage engines optimize for right-growing inserts. In that case RocksDB with leveled compaction will have write amplification of 2 -- write once for the WAL, write again for the memtable flush, no compaction. Finally, this makes it easier to explain write-heavy workloads.

Steams and insert points

I use ordered arrays rather than indexes to explain streams (insert points). Assume the array starts as: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], this represents the keys in the PK index and there are no secondary indexes. Some of the examples showwhy I use streams to describe this.

Examples:

Random

pk=(inf)
insert sequence: 1.5, 6.5, 1.7, 8.1, 0.0, 4.5, ...

Right growing

pk=(nAsc:1, nRHS:1)
insert sequence: 10.0, 11.0, 12.0, ...

Left growing

pk=(nDesc:1, nLHS:1)
insert sequence: 0.0, -1.0, -2.0, ...

Left & right growing

pk=(nAsc:1, nDesc:1, nLHS:1, nRHS:1)
insert sequence: 10.0, 0.0, -1.0, 11.0, 12.0, -2.0
insert sequence as interleaved streams: [10.0, 11.0, 12.0] and [0.0, -1.0, -2.0]

1 middle ascending

pk=(nAsc:1, nRHS:0)
insert sequence: 8.1, 8.11, 8.111, 8.1111, ...

1 middle descending

pk=(nDesc:1, nLHS:1)
insert sequence: 7.9, 7.89, 7.889, 7.8889, ...

1 middle ascending, 1 middle descending

pk=(nAsc:1, nDesc:1)
insert sequence: 8.1, 7.9, 8.11, 7.89, 8.111, 7.889, ...
insert sequence as interleaved streams: [8.1, 8.11, 8.111] and [7.9, 7.89, 7.889]

2 middle ascending:

pk=(nAsc:2)
insert sequence: 8.1, 6.1, 8.11, 6.11, 8.111, 6.111, ...
insert sequence as interleaved streams: [8.1, 8.11, 8.111] and [6.1, 6.11, 6.111]

N middle ascending

pk=(nAsc:N) for some finite value N

Explaining the insert benchmark

Until recently I ran the insert benchmark by first creating a PK index and 3 secondary indexes per table (or collection) and then doing inserts. Informally, the inserts were in PK order but random WRT to each secondary index. More formally, the insert pattern is the following when the secondary indexes are named s1, s2 and s3:

pk=(nAsc:1, nRHS:1)
s1=(inf)
s2=(inf)
s3=(inf)

The insert benchmark can become extremely IO-bound because of the random insert patterns for each of the secondary indexes. In the worst case with a B-Tree there is one page read and one page written back per secondary index per insert (3 pages read, 3 pages written back with 3 secondary indexes).

I recently changed the way that I run the insert benchmark to create the PK index, load some data, create secondary indexes and then load more data. In this case the insert pattern during load some data is:

pk=(nAsc:1, nRHS:1)

And then during load more data (with secondary indexes in place) is:

pk=(nAsc:1, nRHS:1)
s1=(inf)
s2=(inf)
s3=(inf)

Tuesday, July 14, 2020

Review: The Design of the Postgres Storage System

This is a review of The Design of the Postgres Storage System. The paper was in VLDB 1987 although my review used a version of the paper that might differ from the one in VLDB (which is appropriate, given MVCC). This is the also the first in a series of posts on MVCC GC. My standard disclaimer is that I don't review papers that aren't worth reading.

The paper is about POSTGRES, not modern PostgreSQL. POSTGRES was using POSTQUEL at the time and the paper has sample POSTQUEL statements. Something I like very much about the paper is the use of simple estimates to explain the design decisions.

1 Introduction

The abstract and introduction provide four reasons for the research:

Optionally archive all previous versions of rows on separate storage as WORM optical disks were likely to arrive on the market.
Don't use a WAL (redo log). The benefits include instant recovery, less code complexity and better support for archiving.
Use multiple processes to benefit from multi-processor systems that were likely to arrive and were in development at UC Berkeley and elsewhere.
Use non-volatile RAM (NVRAM) to boost performance.

I assume that the first point (archive all previous versions) was the big deal. While this feature might not have been an immediate success it has turned into a big deal in production systems today. Atlas Data Lake from MongoDB is one example. Another popular way to archive all versions is via change data capture (CDC) to store the older versions in a data warehouse separate from the OLTP system.

2.1 Transaction System

POSTGRES used 40-bit transaction IDs and stated that was sufficient for 320 years of operation assuming 1 TPS. The POSTGRES transaction log uses 2 bits per transaction -- there is no redo log. Logically the log is a large array indexed by XID. The bits represent the status of a transaction: committed, aborted, in progress. The XID is assigned at transaction start. Commit is done by changing the bit in the log to committed, forcing the log page to stable storage and forcing modified database pages to stable storage. Stable storage is either magnetic disk or NVRAM.

The log tail is log from the XID of the oldest active transaction to the present and requires 2 bits per XID. The body is the rest of the log (transactions oldest than the oldest active transaction). As all transactions in the log body are either committed or aborted only 1 bit per XID is needed for the log body.

The goal is to keep the log tail in NVRAM and the log body cached in memory. The log body is read-only, the log tail is not. Both can be searched to determine whether a row is visible to a snapshot and the goal is to avoid disk reads in that search. The paper also explains that a bloom filter can be created on the XIDs of aborted transactions to avoid keeping the log body in memory.

Modern PostgreSQL uses 32-bit transaction IDs and wraparound is a source of problems. Other difference are that modern PostgreSQL has a redo log, doesn't force modified pages to stable storage on commit and doesn't (yet) try to take advantage of NVRAM.

I expect that POSTGRES had worse write-amplification then a system that didn't force dirty pages on commit. But I am unlikely to run the insert benchmark to confirm this. Besides, LMDB does FORCE on commit and has many happy users.

2.2 Relation Storage

The per-row metadata includes:

OID - system-assigned unique ID
Xmin, Xmax - the XID that starts, ends the version
Tmin, Tmax - commit time of XID from Xmin, Xmax
Cmin, Cmax - ID of command that starts, ends the version. This is 1-byte so there could be at most 256 commands (statements?) per transaction.
PTR - pointer to older or newer version of the row (explained below)

Modern PostgreSQL has similar per-row metadata. The differences are that in PostgreSQL the XID is 32 bits, there is only one field for command ID, there is a 6-byte tuple ID (TID) and the OID is usually not used for user tables.

Fields were set:

On insert the OID, Xmin and Cmin were set. Tmin was not set because commit had yet to occur.
On update Xmax and Cmax were set to end the row version and a new version of the row was inserted (hopefully to the same page). The new version reused the OID of the ended version and the PTR for the new version pointed to the ended version.
On delete Xmax and Cmax were set.

To use less space updates only stored fields that changed and the other fields were found by following the PTR chain (a singly-linked list). The oldest version of a row was called the anchor point. The notion of an anchor point and update (delta) chain is similar to the current support for Heap Only Tuples (HOT) in modern PostgreSQL. I wonder if that is a feature that was removed in early PostgreSQL and then was returned for a different reason.

2.3 Time Management

This section shows the logic required to determine whether a version is visible to a query. The check is more complicated than what InnoDB and RocksDB require, but I assume the CPU overhead is not that different than what occurs in modern PostgreSQL and in my testing of modern PG this isn't an issue. The logic includes a check of the transaction log to determine whether the transaction from Xmin or Xmax committed. That check wouldn't be needed if the commit timestamp were written into the row on commit -- but doing that is non-trivial and can hurt performance. The need to check the transaction log also means that the searched parts of the log must remain in memory or there will be disk reads. The ability to keep that in memory is explained in section 2.1. I am wary of the ability to keep the log in memory for high TPS systems but this is a problem they didn't need to solve at the time.

2.4 Concurrency Control and Timestamp Management

POSTGRES contains a TIME relation that has the commit time for each transaction. This has 32 bits per XID and is updated on commit. The tail of TIME should be stored in stable main memory to avoid forcing a disk page on commit.

Relations are marked by the user as no-archive, light-archive or heavy-archive. Tmin and Tmax are never set for no-archive relations and I assume old versions for them are not moved to the archive. For light-archive, old versions are moved to the archive but Tmin/Tmax are not set to avoid the overhead of doing a search of the transaction log to determine their status. For heavy-archive the reader (a query) will lookup the commit time from the TIME relation and update Tmin/Tmax (thus making a page dirty). Vacuum sets Tmin/Tmax for heavy-archive when moving older versions to the archive. It is possible that the thing (query, vacuum) that searches TIME will be delayed by disk reads.

2.5 Record Access

On each page there is a line table with an entry per anchor point record. Secondary index entry points to line table entry. On update a secondary index only needs maintenance if the indexed columns have been changed.

Modern PostgreSQL uses the name line pointer. Also modern PostgreSQL does secondary index maintenance for all secondary indexes unless no indexed columns have changed. So if there 3 secondary indexes and an update changes a column used by 1 of them then maintenance is done for all of them -- unless HOT is used. If no indexed columns have changed then the Heap Only Tuples (HOT) optimization is used and the new version is added to the end of the update chain and secondary index entries reference the line pointer for the head of the update chain. Quoting from the HOT document:

Without HOT, every version of a row in an update chain has its own index entries, even if all indexed columns are the same. With HOT, a new tuple placed on the same page and with all indexed columns the same as its parent row version does not get new index entries.

3.1 Vacuuming the disk

POSTGRES had a command to trigger vacuum of a relation. The example was vacuum rel-name after "30 days". This reclaims space from aborted transactions and moves old versions to the archive. Old versions for relations marked as light-archive and heavy-archive are moved to archive storage. If heavy-archive is set for the relation then vacuum will set Tmin/Tmax if unset. Differences between POSTGRES and modern PostgreSQL include:

Vacuum in modern PostgreSQL doesn't move older versions to an archive. It does reclaim space for versions that have been deleted and are no longer visible. It also sets bits in the visibility map and does work to avoid transaction ID wraparound.
Vacuum did a full scan of the relation in POSTGRES while modern PostgreSQL only checks pages that require vacuum courtesy of the visibility map
Vacuum in modern PostgreSQL does a full index scan for every secondary index of the vacuumed relation when there are rows to remove.

3.2 Archival Medium

The target archival media was optical WORM. While WORM might not have been a huge hit CD-R and DVD-R were a big deal for a long time. Zip drives were a big deal for a shorter time and now we have USB thumb drives. Maybe WORM will return in the form of ultra-low-endurance NAND flash SSDs that support only one device write.

The paper also explained interesting ways to manage secondary indexes using both magnetic disk and archive devices with plans for R-trees to support efficient time-bounded queries.

3.3 Vacuum Process

Vacuum in POSTGRES did:

Write archive record and index entries
Write new anchor point in current database, insert new index entries
Reclaim space from old anchor point and delta records

This wasn't crash safe but POSTGRES did the right thing in spite of crashes. Crashes could leave duplicate records with a copy of the same version in both the archive and main store. But POSTGRES was relational and eliminated such duplicates. I explain differences with modern PostgreSQL above in section 3.1.

5.1 [Performance Comparison] Assumptions

One nit I have with the paper is the argument that CPU is not a critical resource. It listed a few reasons for this -- CPUs were getting much faster than disk, multi-processors were coming, co-processors could be used and custom logic could be used. While the CPU-disk speed gap was growing the paper ignored that RAM density was growing quickly and many DBMS applications would be less IO-bound in the future.

Another nit is that the paper ignores the overhead from vacuum. Vacuum doesn't just use CPU. It reads from the vacuumed relations and dirties pages in them. Accounting for that overhead would be complicated and the focus of the paper was on simple performance models, which made it a nice paper to read.

Small Datum