@ebyhr: CMU DB Systems Spring 2017

Webで公開されているカーネーギーメロン大学のDBに関する授業を一通り見てみました。
CMU Database Systems Spring 2017

多くの情報工学科(CS)でCPUやOSに関する授業はあると思うんですが、DB関連で、SQLや正規化ではなく内部的な話をやる学校ってかなり珍しいのではと思っています。

L05 - Multi-Version Concurrency Control
Append-Only Storage: 全ての更新は新しいタプルをテーブルの空き部分に作成し、参照先をその部分に変更する
Time-Travel Storage: Time-Travelテーブルに複製し、メインテーブルのバージョンを上書きする
Delta Storage: 変更された値のみをDeletaストレージセグメントに複製し、マスターバージョンを上書きする
PKey Indexesはバージョンチェインのヘッドを常に指す。Append-Only
Secondary Indexes: Logical Pointers(Primary Indexに対してPrimary Keyを持つ)とPhysical Pointersの２つに分けられる
Microsoft HEKATON: OLTP Engine for MSSQL started in 2008

L06 - Index Locking & Latching
Index Locks vs. Latches
INDEX: improves the speed of date retrieval. The additional costs are writes and storage spaces.
Order Preserving Indexes: Maintain keys in some sorted order. O(log n)
Hashing Indexes: Maps a hash of the key to particular record. Only supports equality predicates with O(1)
B-tree: Values in all nodes in the tree
B+tree: Only stores values in leaf nodes

Locks vs. Latches

Separate User transactions vs. Threads
Protect Database Contents vs. In-Memory Data Structures
During Entire Transactions vs. Critical Sections
Deadlock Detection & Resolution vs. Avoidance

Latch Implementions

Blocking OS Mutex std::mutex
Test-and-Set Spinlock(TAS) std::atomic<T>
Queue-based Spinlock (MCS) std::atomic<Latch*>
Reader-Writer Locks

Index Locking Schemes

Predicate Locks: Shared lock on SELECT-WHERE, Exclusive lock on UPDATE, INSERT, DELEET-WHERE. Hard to implement
Key-Value Locks: Locks that cover a single key value
Gap Locks
Key-Range Locks
Hierarchical Locking: Allow for a txn to hold wider key-range locks with differecnt locking modes

L07 Latch-free OLTP Indexes
Bw-Tree T-Trees: Based on AVL Trees. Stores pointers to their original values. Uses less memory. Difficult to rebalance, implement safe concurrent access
Skip Lists: Insert and Delete don’t require rebalancing
Index Implementation Issues
Reference Counting: Increment the counting before accessing, Decrement it when finished. Bad performance for multi core CPU
Read-Copy-Update
Epoch Garbage Collection

L11 Database Compression

Block-level
Tuple-level
Attribute-level
Column-level

NAÏVE COMPRESSION
-> LZO(1996), LZ4(2011), Snappy(2011), Zstd(2015)

COLUMNAR COMPRESSION

Null Suppression (Useful in wide tables with sparse data)
Run-length Encoding
Bitmap Encoding (Only practical if the value cardinality is low)
Delta Encoding
Incremental Encoding
Mostly Encoding
Dictionary Encoding

Multi-Attribute Encoding
(original)
val1 val2
A 202
B 101
A 202
C 101
B 101

(compressed)
val1+val2
XX
YY
XX
ZZ
YY

val1 val2 code
A 202 XX
B 101 YY
C 101 ZZ

Order-Preserving Compression
(original)
name
Andrea
Joy
Andy
Dana
(compressed)

name
10
40
20
30
value code
Andrea 10
Andy 20
Dana 30
Joy 40

select * from users where ename like ‘And’
vs
select * from users name between 10 and 20

select * from users where ename like ‘And’
-> Still have to perform seq scan

select distinct * from users where ename like ‘And’
-> Only need to access dictionary

OLTP DBMS cannot use the OLAP compress techniques because we need to support fast random tuple access. Indexes consume a large portion of the memory fo an OLTP database.

L12 Logging Protocols
The gold standard for physical logging & recovery in a disk-oriented DBMS is ARIES Algorithms for Recovery and Isolation Exploting Semantics

Write-Ahead Logging
Repeating History During Redo
Logging Changes During Undo

SiloR uses the epoch-based OCC, it achieves high performance by parallelizing all aspects of logging, checkpointing, and recovery.

The logger threads write buffers out to files. After 100 epoches, it creates a new file
Log record format, Id of the txn that modified the record(TID), (Table, Key, Value)

update people
set islame = true
where name in ('Dana', 'Andy')

Txn#1001
[people, 888, (islame->true)]
[people, 999, (islame->true]

L13 Checkpointing Protocols
Checkpointing allows the systems to ignore large segments of the log to reduce recovery time.

Naïve Snapshots
Hyper - Fork Snapshots
Copy-On-Update Snapshot
VoltDB - Consistent Checkpoints
Wait-Free ZigZag
Wait-Free Ping-Pong

Facebook Scuba
By storing the database shared memory, the DBMS process can restart and the memory contents will survive
On shutdown, copy data from heap to shared memory

L14 Optimizer Implementation (Part 1)
Heuristic-Based Optimization
Define static rules that transform logical operators to a physical plan
Advantages

Easy to implement and debug
Works reasonably well and is fast for simple queries

Disadvantages

Relies on magic constants that predict the efficacy of a planning decision
Nearly impossible to generate good plans when operators have complex inter-dependencies

Heuristics + Cost-Based Join Search
Use static rules to perform initial optimization. Then use dynamic programming to determine the best join order for tables

Advantages

Usually finds a reasonable plan without having to perform an exhaustive search

Disadvantages

Left-deep join trees are not always optimal

Randomized Algorithms
Advantages

jumping around the search space randomly allows the optimizer to get out out of local minimums
Low memory overhead (if no history is kept)

Disadvantages

Difficult to determine why the DBMS may have chosen a particular plan
Have to do extra work to ensure that query plans are deterministic
Still have to implement correctness rules

Starburst Optimizer
Advantages

Works well in practice with fast performance

Disadvantages

Difficult to assign priorities to transformations
Rules maintenance is a huge pain

Volcano Optimizer
Advantages

Use declarative rules to generate transformations
Better extensibility with an efficient search engine

Disadvantages

All equivalence classes are completely expanded to generate all possible logical operators before the optimization search
Not easy to modify predicates

L15 Optimizer Implementation (Part 2)

Stratified Search
First rewrite the logical query plan using transformation rules and then perform a cost-based search to map the logical plan to physical plan
Unified Search
Unify the notion of both logical→logical and logical→physical transformations
Cascade Optimizer
Object oriented implementation of the Volcano query optimzer

Search Termination

Wall-clock Time
Cost Threshold
Transformation Exhaustion

L16 Cost Models
Cost model components are Physical costs, Logical costs and Algorithmic costs.
Postgres cost model uses a combination of CPU and I/O costs.
In memory is 400x faster then disk

Sequntial I/O is 4x faster than random I/O
IBM DB2 Learning Optimizer updates table stats as the DBMS scans a table during normal query processing.

L17 Query Execution & Scheduling

Process Models

Process per DBMS Worker

Each worker is a separated OS process
Use shared-memory for global data structures

Process Pool

A worker uses any process that is free in a pool
Bad for CPU cache locality

Thread per DBMS Worker

Single process with multiple worker threads
Thread crash (may) kill the entire system

Task Assignment

Push: A centralized dispatcher assigns tasks to workers and monitors their progress
Pull: Workers pull the next task from a queue, process it, and then return to get the next task

L18 Parallel Join Algorithms (Hashing)
Many OLTP DBMSs don't implement hash join.
Hash join is the most important operator in a DBMS for OLAP workloads.

Hash join
Phase #1 Partition
Phase #2 Build

Phase #3 Probe

Partition phase
Approach #1 Non-Blocking Partitioning (Shared Partitions or Private Partitions)
Approach #2 Blocking Partitioning (Radix)

Build phase
For each tuple, hash the join key attribute for that tuple and add it to the appropriate bucket in the hash table.

Probe phase
For each tuple in S, hash its join key and check to see whether there is a match for each tuple in corresponding bucket in the hash table constructed for R.

Hash functions
* MurmurHash (2008)
* Google CityHash (2011)
* Google FarmHash (2014)
* CLHash (2016)

Hash table implementations
* Chained hash table: maintain a linked list of buckets for each slow in the hash table.
* Open-Addressing hash table: Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table.
* Cuckoo hash table: Use multiple hash tables with different hash functions.

L19 Parallel Join Algorithms (Sorting)
SIMD
* Significant performance gains and resource utilization if an algorithm can be verctorized.

Moving data between DRAM and GPU is slow over PCI-E bus

Sort-Merge Join
Phase #1 Sort: Sort the tuples of R and S based on the join key
Phase #2 Merge: The outer relation R only needs to be scanned once

Sorting is always the most expensive part

Parallel Sort-Merge Join
Phase #1 Partitioning (optional): Partition R and assign them to workers / cores
Phase #2 Sort
Phase #3 Merge

Partitioning phase
* Divide the relations into chunks and assign them to cores

Sort phase
* Create runs of sorted chunks of tuples for both input replations
* It used to be that Quicksort was good enough
* Sorting networks is similar to ghost leg

Merge phase
* Iterate through the outer table and inner table in lockstep and compare join keys

Sort-Merge Join Variants
* Multi-way sort-mrge (m-way)
* Multi-pass sort-merge (m-pass)
* Massively parallel sort-merge (mpsm)

L20 Query Compilation
Code generation
* Transpilation
* JIT Compilation: generate IR(intermediate representation) of the query

L21 Vectorized Execution (Part I)
Mutli-Cores vs MIC(Many Integrated Cores)
Streaming SIMD Extensions (SSE) is a collection SIMD instructions that target special 128-bit SIMD registers
The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory

L22 Vectorized Execution (Part II)

Bitweaving
* Horizontal
* Vertical

L23 Larger-than-Memory Databases

OLTP Issues
* Run-time Operations ➝ Cold Tuple Identification
* Eviction Policies ➝ Timing and Evicted Tuple Metadata
* Data Retrieval Policies ➝ Granularity, Retireval Mechanism and Merging back to memory

L24 Non-Volatile Memory Database Systems

Fundamental elements of circutis
1. capacitor
2. Registor
3, Inductor
4. Memristor

Phase-change memory (PRM)
* A short pulse changes the cell to a '0'
* A long, gradual pulse changes the cell to a '1'

Resistive RAM (ReRAM)
Running a current one direction moves electrons from the top TiO2 layter to the botttom, thereby changing the resistance

Magnetoresistive RAM (MRAM)
Stores data using magnetic storage elements instead of electric charge or current flows

NVM for database systems
Block-addressable NVM is not that interesting.
Byte-addressable NVM will be a game changer but will require some work to use correctly

NVM-Aware memory allocator
#1 synchronization
➝ The allocator writes back CPU cache lines to NVM using the CLFLUSH instruction
➝ It then issues a SFENCE instruction to wait for the data to become durable on NVM

#2 Naming
➝ The allocator ensures that virtual memory addresses assigned to a memory-mapped region never change even after the OS or DBMS restarts.

About Me

2017-07-01

CMU DB Systems Spring 2017