likes
comments
collection
share

从头开始构建数据库:05. B-Tree: The Practice (Part II)

作者站长头像
站长
· 阅读数 18

05. B-Tree: The Practice (Part II)

05.B树:实践(第二部分)

Following the previous chapter on B-tree implementation. 继上一章B树实现之后。

5.1 The B-Tree Deletion

5.1 B树删除

Step 1: Delete From Leaf Nodes

步骤1:从叶节点删除

The code for deleting a key from a leaf node is just like other nodeReplace* functions. 从叶节点删除键的代码与其他 nodeReplace* 函数一样。

// remove a key from a leaf node
func leafDelete(new BNode, old BNode, idx uint16) {
    new.setHeader(BNODE_LEAF, old.nkeys()-1)
    nodeAppendRange(new, old, 0, 0, idx)
    nodeAppendRange(new, old, idx, idx+1, old.nkeys()-(idx+1))
}

Step 2: Recursive Deletion

第2步:递归删除

The structure is similar to the insertion. 结构与插入类似。

// delete a key from the tree
func treeDelete(tree *BTree, node BNode, key []byte) BNode {
    // where to find the key?
    idx := nodeLookupLE(node, key)
    // act depending on the node type
    switch node.btype() {
    case BNODE_LEAF:
        if !bytes.Equal(key, node.getKey(idx)) {
            return BNode{} // not found
        }
        // delete the key in the leaf
        new := BNode{data: make([]byte, BTREE_PAGE_SIZE)}
        leafDelete(new, node, idx)
        return new
    case BNODE_NODE:
        return nodeDelete(tree, node, idx, key)
    default:
        panic("bad node!")
    }
}

Step 3: Handle Internal Nodes

第三步:处理内部节点

The difference is that we need to merge nodes instead of splitting nodes. A node may be merged into one of its left or right siblings. The nodeReplace* functions are for updating links. 不同的是我们需要合并节点而不是分裂节点。一个节点可以合并到它的左兄弟节点或右兄弟节点之一。 nodeReplace 函数用于更新链接。

// part of the treeDelete()
func nodeDelete(tree *BTree, node BNode, idx uint16, key []byte) BNode {
    // recurse into the kid
    kptr := node.getPtr(idx)
    updated := treeDelete(tree, tree.get(kptr), key)
    if len(updated.data) == 0 {
        return BNode{} // not found
    }
    tree.del(kptr)

    new := BNode{data: make([]byte, BTREE_PAGE_SIZE)}
    // check for merging
    mergeDir, sibling := shouldMerge(tree, node, idx, updated)
    switch {
    case mergeDir < 0: // left
        merged := BNode{data: make([]byte, BTREE_PAGE_SIZE)}
        nodeMerge(merged, sibling, updated)
        tree.del(node.getPtr(idx - 1))
        nodeReplace2Kid(new, node, idx-1, tree.new(merged), merged.getKey(0))
    case mergeDir > 0: // right
        merged := BNode{data: make([]byte, BTREE_PAGE_SIZE)}
        nodeMerge(merged, updated, sibling)
        tree.del(node.getPtr(idx + 1))
        nodeReplace2Kid(new, node, idx, tree.new(merged), merged.getKey(0))
    case mergeDir == 0:
        if updated.nkeys() == 0 {
            // kid is empty after deletion and has no sibling to merge with.
            // this happens when its parent has only one kid.
            // discard the empty kid and return the parent as an empty node.
            assert(node.nkeys() == 1 && idx == 0)
            new.setHeader(BNODE_NODE, 0)
            // the empty node will be eliminated before reaching root.
        } else {
            nodeReplaceKidN(tree, new, node, idx, updated)
        }
    }
    return new
}

Extra care regarding empty nodes: If a node has no siblings, it cannot be merged, even if all its keys are deleted. In this case, we need to remove the empty node, this will also cause its parent to become an empty node, the empty node will propagate upwords until eventually merged. 特别注意空节点:如果一个节点没有兄弟节点,则即使删除其所有键,也无法合并该节点。在这种情况下,我们需要删除空节点,这也会导致其父节点成为空节点,空节点会向上传播直到最终合并。

// merge 2 nodes into 1
func nodeMerge(new BNode, left BNode, right BNode) {
    new.setHeader(left.btype(), left.nkeys()+right.nkeys())
    nodeAppendRange(new, left, 0, 0, left.nkeys())
    nodeAppendRange(new, right, left.nkeys(), 0, right.nkeys())
}

Step 4: The Conditions for Merging

第四步:合并的条件

The conditions for merging are: 合并的条件是:

  1. The node is smaller than 1/4 of a page (this is arbitrary). 该节点小于页面的 1/4(这是任意的)。
  2. Has a sibling and the merged result does not exceed one page. 有同级且合并结果不超过一页。
// should the updated kid be merged with a sibling?
func shouldMerge(
    tree *BTree, node BNode,
    idx uint16, updated BNode,
) (int, BNode) {
    if updated.nbytes() > BTREE_PAGE_SIZE/4 {
        return 0, BNode{}
    }

    if idx > 0 {
        sibling := tree.get(node.getPtr(idx - 1))
        merged := sibling.nbytes() + updated.nbytes() - HEADER
        if merged <= BTREE_PAGE_SIZE {
            return -1, sibling
        }
    }
    if idx+1 < node.nkeys() {
        sibling := tree.get(node.getPtr(idx + 1))
        merged := sibling.nbytes() + updated.nbytes() - HEADER
        if merged <= BTREE_PAGE_SIZE {
            return +1, sibling
        }
    }
    return 0, BNode{}
}

The deletion code is done. 删除代码就完成了。

5.2 The Root Node

5.2 根节点

We need to keep track of the root node as the tree grows and shrinks. Let’s start with deletion. 当树生长和收缩时,我们需要跟踪根节点。我们先从删除开始。

This is the final interface for B-tree deletion. The height of the tree will be reduced by one when: 这是B树删除的最终接口。当出现以下情况时,树的高度将减一:

  1. The root node is not a leaf. 根节点不是叶子。
  2. The root node has only one child. 根节点只有一个子节点。
func (tree *BTree) Delete(key []byte) bool {
    assert(len(key) != 0)
    assert(len(key) <= BTREE_MAX_KEY_SIZE)
    if tree.root == 0 {
        return false
    }

    updated := treeDelete(tree, tree.get(tree.root), key)
    if len(updated.data) == 0 {
        return false // not found
    }

    tree.del(tree.root)
    if updated.btype() == BNODE_NODE && updated.nkeys() == 1 {
        // remove a level
        tree.root = updated.getPtr(0)
    } else {
        tree.root = tree.new(updated)
    }
    return true
}

And below is the final interface for insertion: 最终的插入接口如下:

// the interface
func (tree *BTree) Insert(key []byte, val []byte) {
    assert(len(key) != 0)
    assert(len(key) <= BTREE_MAX_KEY_SIZE)
    assert(len(val) <= BTREE_MAX_VAL_SIZE)

    if tree.root == 0 {
        // create the first node
        root := BNode{data: make([]byte, BTREE_PAGE_SIZE)}
        root.setHeader(BNODE_LEAF, 2)
        // a dummy key, this makes the tree cover the whole key space.
        // thus a lookup can always find a containing node.
        nodeAppendKV(root, 0, 0, nil, nil)
        nodeAppendKV(root, 1, 0, key, val)
        tree.root = tree.new(root)
        return
    }

    node := tree.get(tree.root)
    tree.del(tree.root)

    node = treeInsert(tree, node, key, val)
    nsplit, splitted := nodeSplit3(node)
    if nsplit > 1 {
        // the root was split, add a new level.
        root := BNode{data: make([]byte, BTREE_PAGE_SIZE)}
        root.setHeader(BNODE_NODE, nsplit)
        for i, knode := range splitted[:nsplit] {
            ptr, key := tree.new(knode), knode.getKey(0)
            nodeAppendKV(root, uint16(i), ptr, key, nil)
        }
        tree.root = tree.new(root)
    } else {
        tree.root = tree.new(splitted[0])
    }
}

It does two things: 它做了两件事:

  1. A new root node is created when the old root is split into multiple nodes. 当旧的根被分割成多个节点时,就会创建一个新的根节点。
  2. When inserting the first key, create the first leaf node as the root. 插入第一个键时,创建第一个叶节点作为根。

There is a little trick here. We insert an empty key into the tree when we create the first node. The empty key is the lowest possible key by sorting order, it makes the lookup function nodeLookupLE always successful, eliminating the case of failing to find a node that contains the input key. 这里有一个小技巧。当我们创建第一个节点时,我们在树中插入一个空键。空键是按排序顺序可能的最低键,它使查找函数 nodeLookupLE 始终成功,消除了无法找到包含输入键的节点的情况。

5.3 Testing the B-Tree

5.3 测试 B 树

Since our data structure code is pure data structure code (without IO), the page allocation code is isolated via 3 callbacks. Below is the container code for testing our B-tree, it keeps pages in an in-memory hashmap without persisting them to disk. In the next chapter, we’ll implement persistence without modifying the B-tree code. 由于我们的数据结构代码是纯数据结构代码(没有IO),因此页面分配代码通过3个回调进行隔离。下面是用于测试 B 树的容器代码,它将页面保存在内存中的哈希图中,而不将它们保存到磁盘。在下一章中,我们将在不修改 B 树代码的情况下实现持久化。

type C struct {
    tree  BTree
    ref   map[string]string
    pages map[uint64]BNode
}

func newC() *C {
    pages := map[uint64]BNode{}
    return &C{
        tree: BTree{
            get: func(ptr uint64) BNode {
                node, ok := pages[ptr]
                assert(ok)
                return node
            },
            new: func(node BNode) uint64 {
                assert(node.nbytes() <= BTREE_PAGE_SIZE)
                key := uint64(uintptr(unsafe.Pointer(&node.data[0])))
                assert(pages[key].data == nil)
                pages[key] = node
                return key
            },
            del: func(ptr uint64) {
                _, ok := pages[ptr]
                assert(ok)
                delete(pages, ptr)
            },
        },
        ref:   map[string]string{},
        pages: pages,
    }
}

We use a reference map to record each B-tree update, so that we can verify the correctness of a B-tree later. 我们使用引用Map 来记录每次B树的更新,以便我们以后可以验证B树的正确性。

func (c *C) add(key string, val string) {
    c.tree.Insert([]byte(key), []byte(val))
    c.ref[key] = val
}

func (c *C) del(key string) bool {
    delete(c.ref, key)
    return c.tree.Delete([]byte(key))
}

Test cases are left to the reader as an exercise. 测试用例留给读者作为练习。

5.4 Closing Remarks 5.4 结束语

This B-tree implementation is pretty minimal, but minimal is good for the purpose of learning. Real-world implementations can be much more complicated and contain practical optimizations. 这个 B 树实现非常小,但最小对于学习来说是有好处的。现实世界的实现可能要复杂得多,并且包含实际的优化。

There are some easy improvements to our B-tree implementation: 我们的 B 树实现有一些简单的改进:

  1. Use different formats for leaf nodes and internal nodes. Leaf nodes do not need pointers and internal nodes do not need values. This saves some space. 叶节点和内部节点使用不同的格式。叶节点不需要指针,内部节点不需要值。这节省了一些空间。
  2. One of the lengths of the key or value is redundant — the length of the KV pair can be inferred from the offset of the next key. 键或值的长度之一是冗余的——KV 对的长度可以从下一个键的偏移量推断出来。
  3. The first key of a node is not needed because it’s inherited from a link of its parent. 节点的第一个键不是必需的,因为它是从其父级的链接继承的。
  4. Add a checksum to detect data corruption. 添加校验和以检测数据损坏。

The next step in building a KV store is to persist our B-tree to the disk, which is the topic of the next chapter. 构建 KV 存储的下一步是将 B 树持久化到磁盘,这是下一章的主题。

返回目录 下一章 上一章