Overview RDBMS Index 设计

original article: https://xiang753017.gitbook.io/zixiang-blog/database/relation-database-index-overview

Relation Database Context

Relation Database 的资料结构称 page，key 指的是一个 row 的 unique key，而 Row 是指记忆体位置。Pages 是以 B-Tree 储存，

Non-Leaf 的 Page 会是指的是 index 而这样的 row 会指设到下一个 Non-Leaf 的 Page 或者 Leaf-Page

Leaf Page 指的是实际的 table row

How Do Database Retrieve data?

在资料库在读取资料的时候，要找所对应的page 且输出给 user。而实际的 activities 如下图[1][2]：

Read Cache: 从cache 取得 page，时间约 0ms

Read Buffer: 从 buffer queue 取得时间约 3 ms

Read Disk: 从 disk 取得所要的pages3.1 Random Read: about 0.1 MB/s到0.5 MB/s usually use in specific search ex: where a = :a3.2 Sequential Read: about 40 MB /s. usually use in range search ex: where a > :a

Rotate Buffer: 对buffer 进行更新

Store cache: 更新cache

以 2005 年[1]当时候的硬体，若从 disk 找到资料 Random read 约 10ms的时间，现今2025 年 SSD 读写速度： Random Read 约500 mb/s, Sequence Read 8000mb/s

What is the Pre-Fetch in RDBMS?Pre-Fetch 的意思是指在搜集需要的资料的时候，当找到全部 hit 到 index之后再进行 Retrive data，可参考下图; 反之没有Pre-Fetch 就是当Scan 一个 index 之后直接去 retrieve 资料，变成是 sequence 的方式。

Index Perfomance Factories

在资料库中 Index 是用来加速读取使用，index 的原理是以比较小的资料集合 reference 比较大的资料集合; 换句话说，以部分资料来 reference 完整资料。而在读取 Index pages 因为量体比较小，利用 memory 速度快于 disk 的优势快速的找到 table pages. 影响读取速度的关键在于：使用 Random Read & Sequence Read 的次数 (速度比为 Random Read : Sequence Read = 1 : 15)

以参考资料[2]的基本假设作为以下举例，分别的读取速率为：

Random Read: 10 ms

Sequence Read: 1 ms

Fetch Data: 0.1 msExample Table 有五个栏位 id 为 primary key

CREATE TABLE example_table (
id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100),
created_at TIMESTAMP,
status BOOLEAN
)

E.q. 1

-- total 1,000,000 rows
select * from example_table where id = :id

Explain:

db 使用 random read 读取 pk index

db 使用 random read 读取 table row

db fetch dataTotal 时间为： 10ms + 10ms+ 0.1 ms 约 20ms

E.q. 2

-- total 1,000,000 rows
-- hit rate 1% = 100,000
select * from example_table where name = :name

Explain:

db 使用 sequence read 全区域搜寻，共有 1% 的命中率 (100,000 rows)

fetch rows:

Total 时间为： 1 ms * 1,000,000 + 100,000 * 0.1 ms = 1000s + 10s = 1010 s

E.q.3

-- total 1,000,000 rows,
-- index (name, id)
- hit rate 1% = 100,000
select id, name from example_table where name = :name

Explain:

db 使用 random read 读取 indexTotal 时间为： 10 ms

因为 index 已经包含需要fetch 的资料，因此并不需要去取得 table 资料index 是具有顺序性的 (name, id) 会从 name 开始搜寻再往下到 id, 因此 index = (id, name) 则在上述无法hit 到 index

E.q.4

-- total 1,000,000 rows,
-- index (name, id)
- hit rate 1% = 100,000
select * from example_table where name = :name

Explain:

db 使用 random read 读取 index

db 使用 random/sequence read 读取 table row

使用 random/sequence read 取决于资料相邻的程度，

相邻 seq read; 不相邻 random read

db fetch dataBest case Total 时间为： 10ms + 100,000 * 1ms + 100,000 * 0.1 ms = 10ms + 100s + 10s = 110 s Worst case Total 时间为：10ms + 100,000 * 10ms + 100,000 * 0.1 ms = 10ms + 1000s + 10s = 1010s

Index Design

Build A Index Cost

[2]Index 加速的原理，主要是有效的减少需要搜寻的量体，近一步减少在搜寻的 I/O time。但是建制 Index 会在 Create/Update/Delete 的时候消耗写入的速度。

一般来说 index 都是以 Random wirte 进行写入，在传统的 HDD 中大约 10 ms ; SSD 则是约 0.1 ms，换句话说，以HDD为例有10 个 index 要进行插入，则会需要 10 * 10 ms 约 100 ms。此外在高速写入下，也会造成 disk 的本身的负担剧增，导致系统性地被拖累以下提供 rule 作为评估的一个标準[2](page 60):

N: the number of indexes with random inserts

I = insert rate (table rows per second)

D = the number of disk drives (excluding spares)L = N * I/D; if L < 1 则几乎不会有影响; 1 < L < 10 则大概已经会有影响disk 的效率; L > 10 则会高频率的造成问题

How many stars of index are reasonable?

Star 的意思是指，以多少的 col 组成的 index。举个例子，a col 与 b col 组成的 index 称之2 star index。其中超过 3 个含以上的 col 组成的index 称之为 3 star index or Fat index

2 Star index is more reasonable compare with 3 star or Fat index根据index 的原理，index 加速主要是依靠量体小的优势，利用 memory 速度，能快速找到 table page. 因此index 如果越接近于 table 的 col 数量，就越趋近于在进行 full scan 的意思。 index 本身加速依靠 hit rate, 3 star 的 index 本身的利用率其实不高，因为 index 本身搜寻有顺序性的限制。 ex index (a, b) 这一组index 能使用的状况只有在搜寻包含 a or a+b 这两个状况。所以 3 star index 命中率低且额外的消耗 Create/Update/Delete 成本，并不一定划算。

Index 对于Join 的影响

Join 主要的两种方式为 Hash Join & Nest-loop Join ，从时间复杂度的角度分别为 O(N+M), O(N*M)，而 index 在 merge 的步骤，实际上会直接使需要 merge 的量体变小，而考虑到 I/O time 的话，未必 index 能够有效的优化效率。因此在 SQL 的 Search plan 会依照可能的 index hit rate 选择较佳效率的做法，未必会全然使用 index

Example:以参考资料[2]的基本假设作为以下举例，分别的读取速率为：

Random Read: 10 ms

Sequence Read: 1 ms

Fetch Data: 0.1 ms

-- orders has prouducts.id FK
-- order has 10,000,000 rows, products has 20,000,000
select * from orders o, products p where o.product_id = p.id and p.price = 305.65

Eq.1 products.price is a index, hit rate high

index hit rate 极高，假设 p.price = 305.65 只有 10笔 hit

制作 order hash table 约 10,000,000 rows 且塞选符合( o.product_id = p.id ) 使用 sequence read with 3 workers

Merge

Total= 10ms + 10* 10ms + 3,333,333 * 1ms = 10 + 100ms + 3,333s = 3333s

Eq.2 products.price is a index, hit rate low

index hit rate 不高，假设 p.price = 305.65 只有 100,000笔 hit

制作 order hash table 约 10,000,000 rows 且塞选符合( o.product_id = p.id ) 使用 sequence read with 3 workers

Merge

Total= 10ms + 100,000* 10ms + 3,333,333 * 1ms = 10 ms+ 1000s + 3,333s = 4,333s

Eq.3 products.price search by sequnce read

10,000,000 rows at p.price = 305.65 with sequence read

制作 order hash table 约 10,000,000 rows 且塞选符合( o.product_id = p.id ) 使用 sequence read with 3 workers

Merge

Total= 10,000,000 * 1ms + 3,333,333 * 1ms = 10,000s + 3333s = 13333 s

Performance Tuning

Index 成本非常高，建制的时候依靠 random write, 读取 index 通常会使用 random read. 当 filter 搜寻到 raw table 之后大部分情况可能使用 random read + sequence read 取得资料，取决于资料相邻的程度。

思考该 SQL 语句以及使用情境会需要返回多少资料量，进一步推算 index 设计是否合理a. 考虑该 SQL 所下的 filter 适合哪一种搜寻，再进一步设计 index. ex 当语句是属于大范围搜寻且返回资料大的情况下，则未必需要设计 index，因为如果 db 绝大部分使用 random read，有可能实际的速度比 full scan 来得慢，亦或者db 直接使用 full scan 而 index 反而成为拖慢 db 写入速度b. Join 的原理，原则上是会拆成小 table 进行 full a 找 b 表的行为，考虑到 I/O 实际需要经过的资料集，进行 index 设计，或者不使用 Join 搜寻。ex 当 a,b 表都具有 1000 万的资料量，当执行 hash Join , db 会先将比较小的资料集进行一次 filter 并建立 hash table，在对比较大的资料进行一次 full scan and filter. 在这种情况下，index的作用未必有显着效果。

须监控 SQL 的效率，找出 low sql 在近一步讨论 index 是否合理，或者 SQL 是否为最佳。

Reference

[1] https://dev.mysql.com/doc/refman/8.4/en/innodb-buffer-pool.html[2] Lahdenmaki, Tapio, and Mike Leach. Relational Database Index Design and the Optimizers: DB2, Oracle, SQL Server, et al. John Wiley & Sons, 2005.

Overview RDBMS Index 设计

Relation Database Context

Index Perfomance Factories

Index Design

Build A Index Cost

How many stars of index are reasonable?

Index 对于Join 的影响

Performance Tuning

Reference

相关文章