从头开始构建数据库:02. Indexing

站长

2023年11月22日 14:32 · 阅读数 109

2.1 Key-Value Store and Relational DB

2.1 键值存储和关系型数据库

Although a relational DB supports many types of queries, almost all queries can be broken down into three types of disk operations:

尽管关系数据库支持多种类型的查询，但几乎所有查询都可以分为三种类型的磁盘操作：

Scan the whole data set. (No index is used).

扫描整个数据集。（不使用索引）。
Point query: Query the index by a specific key.

点查询：通过特定key查询索引。
Range query: Query the index by a range. (The index is sorted).

范围查询：按范围查询索引。（索引已排序）。

Database indexes are mostly about range queries and point queries, and it’s easy to see that a range query is just a superset of point queries. If we extract the functionality of the database indexes, it is trivial to make a key-value store. But the point is that a database system can be built on top of a KV store.

数据库索引主要用于范围查询和点查询，很容易看出范围查询只是点查询的超集。如果我们提取数据库索引的功能，那么创建一个键值存储就很简单了。所以重点是数据库系统可以构建在 KV 存储之上。

We’ll build a KV store before attempting the relational DB, but let’s explore our options first.

在尝试关系数据库之前，我们将构建一个 KV 存储，但让我们首先探索我们的选项。

2.2 Hashtables 2.2 哈希表

Hashtables are the first to be ruled out when designing a general-purpose KV store. The main reason is sorting — many real-world applications do require sorting and ordering. 在设计通用 KV 存储时，首先要排除哈希表。主要原因是排序——许多现实世界的应用程序确实需要排序和顺序。

However, it is possible to use hashtables in specialized applications. A headache of using hashtables is the resizing operation. Naive resizing is O(n) and causes a sudden increase in disk space and IO. It’s possible to resize a hashtable incrementally, but this adds complexity. Another problem with hashtables is when to resize down; hashtables generally don’t shrink automatically to avoid frequent and costly resizing, at the cost of wasted disk space. 但是，可以在专门的应用程序中使用哈希表。使用哈希表的一个令人头疼的问题是重调整大小操作。简单的调整大小是 O(n) 并导致磁盘空间和 IO 突然增加。可以逐步调整哈希表的大小，但这会增加复杂性。哈希表的另一个问题是何时调整大小；哈希表通常不会自动收缩，以避免频繁且昂贵的调整大小，从而浪费磁盘空间。

2.3 B-Trees 2.3 B树

Balanced binary trees can be queried and updated in O(log(n)) and can be range-queried. A B-tree is roughly a balanced n-ary tree. Why use an n-ary tree instead of a binary tree? There are several reasons: 平衡二叉树可以在 O(log(n)) 中进行查询和更新，并且可以进行范围查询。 B 树大致是平衡的 n 叉树。为什么使用n叉树而不是二叉树？有以下几个原因：

Less space overhead. 更少的空间开销。

Every leaf node in a binary tree is reached via a pointer from a parent node, and the parent node may also have a parent. On average, each leaf node requires 1~2 pointers.

二叉树中的每个叶节点都是通过父节点的指针访问的，并且父节点也可能有父节点。平均每个叶子节点需要1~2个指针。

This is in contrast to B-trees, where multiple data in a leaf node share one parent. And n-ary trees are also shorter. Less space is wasted on pointers. 这与 B 树形成对比，在 B 树中，叶节点中的多个数据共享一个父节点。而且 n 元树也更短。指针上浪费的空间更少。
Less disk IO. 更少的磁盘IO。
- B-trees are shorter, which means fewer disk seeks. B 树更短，这意味着更少的磁盘寻道。
- The minimum size of disk IOs is usually the size of the memory page (probably 4K). The operating system will fill the whole 4K page even if you read a smaller size. It’s optimal if we make use of all the information in a 4K page (by choosing the node size of at least one page). 磁盘IO的最小大小通常是内存页的大小（可能是4K）。即使您读取较小的尺寸，操作系统也会填满整个 4K 页面。如果我们利用 4K 页面中的所有信息（通过选择至少一页的节点大小），这是最佳的。
Faster in memory.

内存速度更快。

Even when the data is cached in memory and disk IO is out of the equation, due to modern CPU memory caching and other factors, n-ary trees can be faster than binary trees even if their big-O complexity is the same. 即使数据缓存在内存中并且不考虑磁盘 IO，由于现代 CPU 内存缓存和其他因素，n 叉树也可以比二叉树更快，即使它们的大 O 复杂度相同。

We’ll use B-trees in this book. But B-trees are not the only option. 我们将在本书中使用 B 树。但 B 树并不是唯一的选择。

2.4 LSM-Trees 2.4 LSM树

Log-structured merge-tree. Here is a high-level overview of how LSM-Tree works. 基于日志结构的合并树。以下是 LSM-Tree 工作原理的高级概述。

Let’s start with 2 files: a small file holding the recent updates and a big file holding the rest of the data. Updates go to the smaller file first, but the file cannot grow forever, it has to be merged with the big file at some point to create a new, bigger file. Compare this to the dumb approach of overwriting the whole database when you update something, this is an improvement because it reduces writes. 让我们从 2 个文件开始：一个保存最近更新的小文件和一个保存其余数据的大文件。更新首先转到较小的文件，但该文件不能永远增长，它必须在某个时刻与大文件合并以创建一个新的更大文件。与更新某些内容时覆盖整个数据库的笨拙的方法相比，这是一种改进，因为它减少了写入。

writes => | new updates | => | accumulated data |
               file 1               file 2

And how do you query the database? You have to query both files, and the newer (smaller) file has higher priority. For point queries, you can query the small file first, and query the big file if it misses. For range queries, both files are queried simultaneously and the results are merged. Deletion is usually done by putting a mark in the small file to indicate that a key has been deleted. The actual deletion takes place when the files are merged. 以及如何查询数据库？您必须查询这两个文件，并且较新（较小）的文件具有更高的优先级。对于点查询，可以先查询小文件，如果漏掉了再查询大文件。对于范围查询，同时查询两个文件并合并结果。删除通常是通过在小文件中放置一个标记来表明某个密钥已被删除。实际删除发生在文件合并时。

Both files contain indexing data structures for queries. The advantage is that you can use simpler data structures because the files aren’t updated in place, since the update operations are replaced by the merge operation. Each file can simply be a list of sorted KVs indexed by an array of pointers — easier and less error-prone to implement than B-trees. 这两个文件都包含用于查询的索引数据结构。优点是您可以使用更简单的数据结构，因为文件不会就地更新，因为更新操作被合并操作取代。每个文件可以简单地是由指针数组索引的排序 KV 列表 - 比 B 树更容易实现并且更不容易出错。

Having 2 files is still not optimal regarding the amount of writes when merging files since the data in the big file is written to disk over and over again. Luckily, this idea can be generalized to more than 2 files, and each “file” is usually called a “level”. Data goes into the 1st level first, and when the 1st level gets too big, the 1st level is merged into the 2nd level, and the 2nd level is now bigger. Each level is merged into the next bigger and older level when it gets too big. 就合并文件时的写入量而言，拥有 2 个文件仍然不是最佳选择，因为大文件中的数据会一遍又一遍地写入磁盘。幸运的是，这个想法可以推广到 2 个以上的文件，每个“文件”通常称为一个“级别”。数据首先进入第1级，当第1级变得太大时，第1级被合并到第2级，并且第2级现在更大。当每个级别变得太大时，都会合并到下一个更大和更旧的级别中。

                 |level 1|
                    ||
                    \/
           |------level 2------|
                    ||
                    \/
|-----------------level 3-----------------|

Why does this scheme writes less than the 2-level scheme? Levels grow exponentially, the multiplier of excess disk write (called write amplification) is O(log(n)) to the data size. For example, you can think of a list of files with exponentially increasing size by the power of two, then you double the size of the 1st file, now the size if the same as the 2nd file, merge it with the 2nd file and then merge it with the 3rd file and etc. 为什么这个方案比2级方案写得少呢？级别呈指数级增长，过量磁盘写入（称为写入放大）的倍数是数据大小的 O(log(n)) 。例如，您可以考虑一个文件列表，其大小按 2 的幂指数增加，然后将第一个文件的大小加倍，现在大小与第二个文件相同，将其与第二个文件合并，然后将其与第三个文件合并等等。

Real databases don’t use the power of two ratio between levels because it creates too many levels, which hurts query performance. The size ratio between levels is usually tunable to allow tradeoffs between write amplification and query performance. 真正的数据库不会使用级别之间的两个比率的幂，因为它创建了太多级别，这会损害查询性能。级别之间的大小比率通常是可调的，以允许在写入放大和查询性能之间进行权衡。

Also, real databases usually implement levels as multiple sorted and non-overlapping files instead of one big sorted file. Merges are performed in small parts, which allows for smoother operation. This also reduces the disk space requirements, otherwise, merging the last level would double the disk space usage. 此外，真正的数据库通常将级别实现为多个排序且不重叠的文件，而不是一个大的排序文件。合并是在小部分中执行的，这使得操作更顺畅。这也减少了磁盘空间要求，否则，合并最后一个级别将使磁盘空间使用量增加一倍。

Readers can try to use LSM-trees instead of B-trees after finishing this book. And compare the cons and pros between B-trees and LSM-trees. 读完本书后，读者可以尝试使用LSM树来代替B树。并比较 B 树和 LSM 树之间的优缺点。

转载自:https://juejin.cn/post/7293175887901884428