About Me

My photo
Software Engineer at Starburst. Maintainer at Trino. Previously at LINE, Teradata, HPE.


CMU DB Systems Spring 2017

CMU Database Systems Spring 2017


L05 - Multi-Version Concurrency Control
Append-Only Storage: 全ての更新は新しいタプルをテーブルの空き部分に作成し、参照先をその部分に変更する
Time-Travel Storage: Time-Travelテーブルに複製し、メインテーブルのバージョンを上書きする
Delta Storage: 変更された値のみをDeletaストレージセグメントに複製し、マスターバージョンを上書きする
PKey Indexesはバージョンチェインのヘッドを常に指す。Append-Only
Secondary Indexes: Logical Pointers(Primary Indexに対してPrimary Keyを持つ)とPhysical Pointersの2つに分けられる
Microsoft HEKATON: OLTP Engine for MSSQL started in 2008

L06 - Index Locking & Latching 
Index Locks vs. Latches
INDEX: improves the speed of date retrieval. The additional costs are writes and storage spaces.
Order Preserving Indexes: Maintain keys in some sorted order. O(log n)
Hashing Indexes: Maps a hash of the key to particular record. Only supports equality predicates with O(1)
B-tree: Values in all nodes in the tree
B+tree: Only stores values in leaf nodes

Locks vs. Latches

  • Separate User transactions vs. Threads
  • Protect Database Contents vs. In-Memory Data Structures
  • During Entire Transactions vs. Critical Sections
  • Deadlock Detection & Resolution vs. Avoidance

Latch Implementions

  • Blocking OS Mutex std::mutex
  • Test-and-Set Spinlock(TAS) std::atomic<T>
  • Queue-based Spinlock (MCS) std::atomic<Latch*>
  • Reader-Writer Locks

Index Locking Schemes

  • Predicate Locks: Shared lock on SELECT-WHERE, Exclusive lock on UPDATE, INSERT, DELEET-WHERE. Hard to implement
  • Key-Value Locks: Locks that cover a single key value
  • Gap Locks
  • Key-Range Locks
  • Hierarchical Locking: Allow for a txn to hold wider key-range locks with differecnt locking modes

L07 Latch-free OLTP Indexes
Bw-Tree T-Trees: Based on AVL Trees. Stores pointers to their original values. Uses less memory. Difficult to rebalance, implement safe concurrent access
Skip Lists: Insert and Delete don’t require rebalancing
Index Implementation Issues
Reference Counting: Increment the counting before accessing, Decrement it when finished. Bad performance for multi core CPU
Epoch Garbage Collection

L11 Database Compression

  • Block-level
  • Tuple-level
  • Attribute-level
  • Column-level
  -> LZO(1996), LZ4(2011), Snappy(2011), Zstd(2015)


  • Null Suppression (Useful in wide tables with sparse data)
  • Run-length Encoding
  • Bitmap Encoding (Only practical if the value cardinality is low)
  • Delta Encoding
  • Incremental Encoding
  • Mostly Encoding
  • Dictionary Encoding
Multi-Attribute Encoding
val1 val2
A    202
B    101
A    202
C    101
B    101


val1 val2 code
A    202  XX
B    101  YY
C    101  ZZ

Order-Preserving Compression

value  code
Andrea 10
Andy   20
Dana   30
Joy    40

select * from users where ename like ‘And’
select * from users name between 10 and 20

select * from users where ename like ‘And’
  -> Still have to perform seq scan

select distinct * from users where ename like ‘And’
  -> Only need to access dictionary

OLTP DBMS cannot use the OLAP compress techniques because we need to support fast random tuple access. Indexes consume a large portion of the memory fo an OLTP database.

L12 Logging Protocols
The gold standard for physical logging & recovery in a disk-oriented DBMS is ARIES Algorithms for Recovery and Isolation Exploting Semantics

  • Write-Ahead Logging
  • Repeating History During Redo
  • Logging Changes During Undo

SiloR uses the epoch-based OCC, it achieves high performance by parallelizing all aspects of logging, checkpointing, and recovery.

  • The logger threads write buffers out to files. After 100 epoches, it creates a new file
  • Log record format, Id of the txn that modified the record(TID), (Table, Key, Value)

update people
set islame = true
where name in ('Dana', 'Andy')

[people, 888, (islame->true)]
[people, 999, (islame->true]

L13 Checkpointing Protocols
Checkpointing allows the systems to ignore large segments of the log to reduce recovery time.

  • Naïve Snapshots
  • Hyper - Fork Snapshots
  • Copy-On-Update Snapshot
  • VoltDB - Consistent Checkpoints
  • Wait-Free ZigZag
  • Wait-Free Ping-Pong

Facebook Scuba
By storing the database shared memory, the DBMS process can restart and the memory contents will survive
On shutdown, copy data from heap to shared memory

L14 Optimizer Implementation (Part 1)
Heuristic-Based Optimization
Define static rules that transform logical operators to a physical plan
  • Easy to implement and debug
  • Works reasonably well and is fast for simple queries

  • Relies on magic constants that predict the efficacy of a planning decision
  • Nearly impossible to generate good plans when operators have complex inter-dependencies
Heuristics + Cost-Based Join Search
Use static rules to perform initial optimization. Then use dynamic programming to determine the best join order for tables

  • Usually finds a reasonable plan without having to perform an exhaustive search

  • Left-deep join trees are not always optimal
Randomized Algorithms
  • jumping around the search space randomly allows the optimizer to get out out of local minimums
  • Low memory overhead (if no history is kept)

  • Difficult to determine why the DBMS may have chosen a particular plan
  • Have to do extra work to ensure that query plans are deterministic
  • Still have to implement correctness rules

Starburst Optimizer
  • Works well in practice with fast performance

  • Difficult to assign priorities to transformations
  • Rules maintenance is a huge pain

Volcano Optimizer
  • Use declarative rules to generate transformations
  • Better extensibility with an efficient search engine

  • All equivalence classes are completely expanded to generate all possible logical operators before the optimization search
  • Not easy to modify predicates

L15 Optimizer Implementation (Part 2)
  • Stratified Search
    First rewrite the logical query plan using transformation rules and then perform a cost-based search to map the logical plan to physical plan
  • Unified Search
    Unify the notion of both logical→logical and logical→physical transformations
  • Cascade Optimizer
    Object oriented implementation of the Volcano query optimzer

Search Termination
  • Wall-clock Time
  • Cost Threshold
  • Transformation Exhaustion

L16 Cost Models
Cost model components are Physical costs, Logical costs and Algorithmic costs.
Postgres cost model uses a combination of CPU and I/O costs.
In memory is 400x faster then disk
  • Sequntial I/O is 4x faster than random I/O
  • IBM DB2 Learning Optimizer updates table stats as the DBMS scans a table during normal query processing.

L17 Query Execution & Scheduling

Process Models
  • Process per DBMS Worker
    • Each worker is a separated OS process
    • Use shared-memory for global data structures
  • Process Pool
    • A worker uses any process that is free in a pool
    • Bad for CPU cache locality
  • Thread per DBMS Worker
    • Single process with multiple worker threads
    • Thread crash (may) kill the entire system

Task Assignment
  • Push: A centralized dispatcher assigns tasks to workers and monitors their progress
  • Pull: Workers pull the next task from a queue, process it, and then return to get the next task
L18 Parallel Join Algorithms (Hashing)
Many OLTP DBMSs don't implement hash join.
Hash join is the most important operator in a DBMS for OLAP workloads.

Hash join
Phase #1 Partition
Phase #2 Build
Phase #3 Probe

Partition phase
Approach #1 Non-Blocking Partitioning (Shared Partitions or Private Partitions)
Approach #2 Blocking Partitioning (Radix)

Build phase
For each tuple, hash the join key attribute for that tuple and add it to the appropriate bucket in the hash table.

Probe phase
For each tuple in S, hash its join key and check to see whether there is a match for each tuple in corresponding bucket in the hash table constructed for R.

Hash functions
* MurmurHash (2008)
* Google CityHash (2011)
* Google FarmHash (2014)
* CLHash (2016)

Hash table implementations
* Chained hash table: maintain a linked list of buckets for each slow in the hash table.
* Open-Addressing hash table: Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table.
* Cuckoo hash table: Use multiple hash tables with different hash functions.

L19 Parallel Join Algorithms (Sorting)
* Significant performance gains and resource utilization if an algorithm can be verctorized.

Moving data between DRAM and GPU is slow over PCI-E bus

Sort-Merge Join
Phase #1 Sort: Sort the tuples of R and S based on the join key
Phase #2 Merge: The outer relation R only needs to be scanned once

Sorting is always the most expensive part

Parallel Sort-Merge Join
Phase #1 Partitioning (optional): Partition R and assign them to workers / cores
Phase #2 Sort
Phase #3 Merge

Partitioning phase
* Divide the relations into chunks and assign them to cores

Sort phase
* Create runs of sorted chunks of tuples for both input replations
* It used to be that Quicksort was good enough
* Sorting networks is similar to ghost leg

Merge phase
* Iterate through the outer table and inner table in lockstep and compare join keys

Sort-Merge Join Variants
* Multi-way sort-mrge (m-way)
* Multi-pass sort-merge (m-pass)
* Massively parallel sort-merge (mpsm)

L20 Query Compilation
Code generation
* Transpilation
* JIT Compilation: generate IR(intermediate representation) of the query

L21 Vectorized Execution (Part I)
Mutli-Cores vs MIC(Many Integrated Cores)
Streaming SIMD Extensions (SSE) is a collection SIMD instructions that target special 128-bit SIMD registers
The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory

L22 Vectorized Execution (Part II)
* Horizontal
* Vertical

OLTP Issues
* Run-time Operations ➝ Cold Tuple Identification
* Eviction Policies ➝ Timing and Evicted Tuple Metadata
* Data Retrieval Policies ➝ Granularity, Retireval Mechanism and Merging back to memory

Fundamental elements of circutis
1. capacitor
2. Registor
3, Inductor
4. Memristor

Phase-change memory (PRM)
* A short pulse changes the cell to a '0'
* A long, gradual pulse changes the cell to a '1'

Resistive RAM (ReRAM)
Running a current one direction moves electrons from the top TiO2 layter to the botttom, thereby changing the resistance

Magnetoresistive RAM (MRAM)
Stores data using magnetic storage elements instead of electric charge or current flows

NVM for database systems
Block-addressable NVM is not that interesting.
Byte-addressable NVM will be a game changer but will require some work to use correctly

NVM-Aware memory allocator
#1 synchronization
➝ The allocator writes back CPU cache lines to NVM using the CLFLUSH instruction
➝ It then issues a SFENCE instruction to wait for the data to become durable on NVM

#2 Naming
➝ The allocator ensures that virtual memory addresses assigned to a memory-mapped region never change even after the OS or DBMS restarts.