likes
comments
collection
share

从头开始构建数据库:00_Introduction

作者站长头像
站长
· 阅读数 73

00. Introduction 00.简介

0.1 What is This Book About?

0.1 这本书是关于什么的?

Databases are not black boxes. Understand them by building your own from scratch! 数据库不是黑匣子。可以通过从头开始构建自己的来了解它们!

This book contains a walk-through of a minimal persistent database implementation. The implementation is incremental. We start with a B-Tree, then a simple KV store, and eventually end with a mini relational DB. 本书包含了最小的可持久化的数据库实现。而且实现过程是渐进的。我们从 B-Tree 开始,然后在每一章中添加一个新概念,最终从简单的 KV 到迷你关系数据库。

The book focuses on important ideas rather than implementation details. Real-world databases are complex and harder to grasp. We can learn faster and easier from a stripped-down version of a database. And the “from scratch” method forces you to learn deeper. 本书重点关注重要的想法而不是实现细节。现实世界的数据库非常复杂且难以掌握。我们可以从数据库的精简版本中更快、更轻松地学习。而“从头开始”的方法迫使你更深入地学习。

Although the book is short and the implementation is minimal, it aims to cover three important topics: 尽管这本书很短并且实现也很少,但它旨在涵盖三个重要主题:

  1. Persistence.How not to lose or corrupt your data. Recovering from a crash. 持久化。如何不丢失或损坏您的数据。崩溃恢复。
  2. **Indexing.**Efficiently querying and manipulating your data. (B-tree). 索引。高效地查询和操作数据。 (B 树)。
  3. **Concurrency.**How to handle multiple (large number of) clients. And transactions. 并发。如何处理大量客户端和事务。

If you have only vague ideas like “databases store my data” or “indexes are fast”, this book is for you. 如果您只有模糊的想法,例如“数据库存储我的数据”或“索引速度很快”,那么这本书适合您。

0.2 How to Use This Book?

0.2 如何使用本书?

This book takes a step-by-step approach. Each step builds on the previous one and adds a new concept. The book uses Golang for sample code, but the topics are language agnostic. Readers are advised to code their own version of a database rather than just read the text. 本书采用循序渐进的方法。每一步都建立在前一步的基础上,并添加了一个新概念。本书使用 Golang 作为示例代码,但主题与语言无关。建议读者编写自己版本的数据库,而不仅仅是阅读文本。

The draft chapters can be accessed at the official website: 章节草稿可在官方网站获取:

build-your-own.org

0.3 Topic One: Persistence

0.3 主题一:持久化

Why do we need databases? Why not dump the data directly into files? Our first topic is persistence. 为什么我们需要数据库?为什么不直接将数据转储到文件中呢?我们的第一个话题是持久化。

Let’s say your process crashed middle-way while writing to a file, or you lost power, what’s the state of the file? 假设您的进程在写入文件时中途崩溃,或者您断电了,文件的状态是什么?

  • Does the file just lose the last write? 文件是否丢失了最后一次写入的内容?
  • Or ends up with a half-written file? 或者最终得到一个写了一半的文件?
  • Or ends up in an even more corrupted state? 或者最终陷入更加不一致的状态?

Any outcome is possible. Your data is not guaranteed to persist on a disk when you simply write to files. This is a concern of databases. And a database will recover to a usable state when started after an unexpected shutdown. 任何结果都是可能的。当写入文件时,并不能保证数据会一定会保留在磁盘上。这是数据库关心的问题。数据库在意外关闭后启动时将恢复到可用状态。

Can we achieve persistence without using a database? There is a way: 我们可以在不使用数据库的情况下实现持久化吗?有一种方法:

  1. Write the whole updated dataset to a new file. 将整个更新的数据集写入新文件。
  2. Callfsyncon the new file. 对新文件调用fsync
  3. Overwrite the old file by renaming the new file to the old file, which is guaranteed by the file systems to be atomic. 通过将新文件重命名为旧文件来覆盖旧文件,这由文件系统保证是原子的。

This is only acceptable when the dataset is tiny. A database like SQLite can do incremental updates. 仅当数据集很小时,这才是可接受的。像 SQLite 这样的数据库可以进行增量更新。

0.4 Topic Two: Indexing

0.4 主题二:索引

There are two distinct types of database queries: analytical (OLAP) and transactional (OLTP). 有两种不同类型的数据库查询:分析型 (OLAP) 和事务型 (OLTP)。

  • Analytical (OLAP) queries typically involve a large amount of data, with aggregation, grouping, or join operations. 分析 (OLAP) 通常涉及大量数据查询,并具有聚合、分组或连接操作。
  • In contrast, transactional (OLTP) queries usually only touch a small amount of indexed data. The most common types of queries are indexed point queries and indexed range queries. 相比之下,事务性 (OLTP) 通常只涉及少量索引数据查询。最常见的查询类型是索引点查询和索引范围查询。

Note that the word “transactional” isnotrelated to database transactions as you may know. Computer jargon is often overloaded with different meanings. This book focuses only on OLTP techniques. 请注意,“事务性”一词与您可能知道的数据库事务无关。计算机行话通常有很多不同的含义。本书只关注 OLTP 技术。

While many applications are not real-time systems, most user-facing software should respond in a reasonable (small) amount of time, using a reasonable amount of resources (memory, IO). This falls into the OLTP category. How do we find the data quickly (inO(log(n))), even if the dataset is large? This is why we need indexes. 虽然许多应用程序不是实时系统,但大多数面向用户的软件应该使用合理数量的资源(内存、IO)在合理(少量)时间内做出响应。这属于 OLTP 类别。即使数据集很大,我们如何快速找到数据(在O(log(n))中)?这就是我们需要索引的原因。

If we ignore the persistence aspect and assume that the dataset fits in memory, finding the data quickly is the problem of data structures. Data structures that persist on a disk to look up data are called “indexes” in database systems. And database indexes can be larger than memory. There is a saying: if your problem fits in memory, it’s an easy problem. 如果我们忽略持久性方面并假设数据集可以全部加载到内存,那么快速找到数据就是数据结构的问题。保存在磁盘上以查找数据的数据结构在数据库系统中称为“索引”。并且数据库索引可以大于内存。有句话说:如果你的问题可以加载到内存,那么它就是一个简单的问题。

Common data structures for indexing include B-Trees and LSM-Trees. 用于索引的常见数据结构包括 B 树和 LSM 树。

0.5 Topic Three: Concurrency

0.5 主题三:并发

Modern applications do not just do everything sequentially, nor do databases. There are different levels of concurrency: 现代应用程序不仅仅按顺序执行所有操作,数据库也是如此。有不同级别的并发:

  • Concurrency between readers. 多个读之间的并发。
  • Concurrency between readers and writers, do writers need exclusive access to the database? 读取和写入之间的并发,写入是否需要对数据库进行独占访问?

Even the file-based SQLite supports some concurrency. But concurrency is easier within a process, which is why most database systems can only be accessed via a “server”. 甚至基于文件的 SQLite 也支持一定程度的并发性。但进程内的并发性更容易,这就是为什么大多数数据库系统只能通过“服务器”访问。

With the addition of concurrency, applications often need to do things atomically, such as the read-modify-write operation. This adds a new concept to databases: transactions. 随着并发性的增加,应用程序通常需要以原子方式执行操作,例如读取-修改-写入操作。这给数据库增加了一个新概念:事务。