@ebyhr: July 2017

2017-07-22

LINE Developer Meetup 19th

LINEのDeveloper Meetupに初めて行ってみました。普段は福岡でやっているデータ周りの内容を東京で開催してみたとのこと。エンジニアとデータサイエンティストに加えてプランナーという職種を置いている話が印象的でした。以下はざっくりメモです。

東京・福岡のデータ分析チームについて/taiichi
組織内のロールはPlanner, Data scientis, Engineerの3つに分かれている。Plannerを置いている会社が少ない
データの品質管理を自動化(方法が気になるけどお聞きするタイミングがなかった…)
Clova platformはNAVERと共同で開発しているので検索エンジンの技術も使用されている
BIだけでなくプロダクトに近い部分でデータを活用している

大規模テキストマイニングによるユーザーの興味関心抽出及び可視化/tkengo
エンジニアの技術的興味関心を抽出するために、CrawlerとScraperを開発してWebからデータを収集
複数サイトでユーザーIDが異なる場合は外部サービスID連携で突合する
同一IDは同一人物と見越して同じと判断する(仮に別人でも数は非常に少ない)

Apache ZeppelinでPySparkを実行するまで/yuta hongo
ZeppelinはSpark以外にもMySQLなどに対してもクエリを実行できる
フォームを埋め込めるのでクエリが分からない人でも値を簡単に入れることができる

東京-福岡連携で実践するグロースハックプロジェクト/AyakaMatsuoka&doradora09
分析前にヒアリングをしてあいまいなポイントをつぶす
指標の定義が細かいことによる集計コストよりやり直しによるコストの方が大きい
「やること」だけでなく、「やりたくなりそうなこと」「できないこと」を事前に握っておく
リモートのメンバーに背景まで説明しなぜその分析をやるか?の疑問を抱かせない。東京側のメンバーだけで分析テーマやタスクを決定しない
リモート環境で分析する際の課題：コミュニケーションコスト
グロースハックはビジネス力、サービスの成長、スピード感、解釈のしやすさが重要
分析はコストなので、中朝的には利益への貢献が不可欠
施作案も分析結果と合わせて提示できるとベター
ユーザーファネルの月次変化（ミドルがハイや休眠に変化したことを見る）
報告テーマの順番の工夫、シンプルな解釈しやすいメッセージング

References
データ分析のための機械学習入門
Web UI for PrestoDB
Apache Zeppelin

2017-07-17

Teradata SQL Memo

This is just a personal memo about Teradata SQL.

-- list all tables
select * from dbc.tables;

-- get ddl
show table table_name;
show view view_name;

-- show current user
select user;

--get suggestion of collect statistics target
diagnostic helpstats on for session;

-- ctas
create table schema.tablename_new as schema.tablename with data;

-- timestamp string -> timestamp -> date
select cast(cast('2016-03-07 23:59:59' as timestamp) as date)

-- decimal -> date -> string
select cast(cast(20170101 as char(8)) as date format 'YYYYMMDD') (format 'yyyy-mm-dd') (char(10))

--format sample
select cast(cast(9999.9999999 as dec(18,10) format '--ZZ9.9999999999' ) as varchar(50))
select cast(cast(123456789.987654321 as dec(19,10) format '--Z(7)9.9(7)' ) as varchar(50))
select cast(cast(-123456789.987654321 as dec(19,10) format '--Z(7)9.9(7)' ) as varchar(50))

--oreplace
select oreplace('hello warld', 'a', 'o');

--qualify
select *
from dbc.tables
qualify
row_number() over (order by databasename) between 1 and 10

--csum(cumulative sum)
select
version,
csum(version, version)
from dbc.tables

--CSUM with reset logic (GROUP BY)
select
version,
csum(version, version)
from dbc.tables
group by version

--mavg(moving average)
select
version
,mavg(version, 3, version)
from dbc.tables

--mavg with reset
select
version
,mavg(version, 3, version)
from dbc.tables
group by 1

--msum(moving sum)
select
distinct
version
,msum(version, 3, version)
from dbc.tables

--mdiff(moving difference)
select
version
,mdiff(version, 3, version)
from dbc.tables

--rank
select
version
,rank(version)
from dbc.tables

--dense rank
select
version
,dense_rank() over (order by version)
from dbc.tables

--quantile
select
distinct
version
,quantile(10, version desc)
from dbc.tables

--lag and lead
select
calendar_date
,max(calendar_date) over(partition by 1 order by calendar_date rows between 1 preceding and 1 preceding)
,min(calendar_date) over(partition by 1 order by calendar_date rows between 1 following and 1 following)
from SYS_CALENDAR.CALENDAR
order by 1

-- merge
merge into department
using values (700, 'shipping', 990000.00) as dept(deptnum, deptname,budgamt)
on dept.deptnum = department_number
when matched then update
set budget_amount = dept.budgamt
when not matched then insert
values (dept.deptnum, dept.deptname, dept.budgamt, null);

-- with recursive
with recursive all_trips
(origin,
destination,
cost,
depth) as
(
select origin, destination, cost, 0
from flights
where origin = 'lax'
union all
select all_trips.origin,
flights.destination,
all_trips.cost + flights.cost,
all_trips.depth + 1
from all_trips inner join flights
on all_trips.destination = flights.origin
and all_trips.depth < 3
)
select * from all_trips;

Aster SQL Memo

This is just a personal memo about Aster SQL.

-- drop table
drop table if exists schema.table;

-- create fact table
create table schema.table (
id integer,
) distribute by hash(id) compress low;

-- create dimension table
create table schema.table (
id character varying,
) distribute by replication compress low;

-- collect statistics
analyze schema.table;

-- * multiple columns to where-in isn't allowed
select * from schema.t1 where (c1, c2) in (select d1, d2 from schema.t2);
-- > error

-- cast
select cast(c1 as integer) from schema.table;
select c1::integer from schema.table;

-- collaborative filtering
select * from cfilter (
on (select 3)
partition by 3
inputtable ('public.input_table')
outputtable ('public.output_table')
inputcolumns ('c1')
joincolumns ('j1')
droptable ('true')
);

-- canopy
select * from canopy(
on (select 1) partition by 1
inputtable('public.input_table')
loosedistance('130.0')
tightdistance('65.0'))
order by canopyid
;

-- kmeans modeling
select * from kmeans
(on (select 1) partition by 1
inputtable('public.input_table')
outputtable('public.output_table')
numberK('3'))
;

-- kmeans scoring
select * from kmeansplot (
on public.input_table partition by any
on public.kmeans_modeling_output dimension
centroidstable ('public.kmeans_modeling_output')
) order by 2,1

2017-07-15

Presto Parser Introduction

Prestoのパーサー部分ではANTLRが使われています。ファイルは以下のパスにあります。パーサー系で追加したり修正する場合はまずここを参照することになるかと思います。
presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4
以下はCTAS文の抜粋です。例えば、SELECT文でカッコをつけても正常にパースしたい構文を追加する例をまずはあげてみます。

CREATE TABLE (IF NOT EXISTS)? qualifiedName
(COMMENT string)?
(WITH tableProperties)? AS query
(WITH (NO)? DATA)? #createTableAsSelect

修正例はこんな感じになります。右側のカッコはエスケープするためにクォートで囲んでいます。なんとなくイメージがつくと思いますが、|はORの役割を果たしているのでqueryもしくは(query)をこれは表しています。

CREATE TABLE (IF NOT EXISTS)? qualifiedName
(COMMENT string)?
(WITH tableProperties)? AS (query | '('query')')
(WITH (NO)? DATA)? #createTableAsSelect

SqlBase.g4を修正したらpresto-parserに移動してビルドします。そうすることでSqlBaseParser.javaというファイルが自動で作成されます。次に修正するのはAstBuilder.javaになります。この中のvisitCreateTableAsSelectメソッドでCreateTableAsSelectNodeを返却しています。上記のqueryを囲むカッコを追加するケースでは後続の処理には影響しないので、変更する必要がありませんが、例えば(COMMENT string)?のように新たなプロパティを追加する際には、この部分を修正する必要があります。

パーサーから先は呼び出しの階層が深いので、説明が困難ですが他の構文を参考にすれば結構楽に追えると思います。presto-spiがコネクターとのインターフェースを担っているので、新たな構文を追加した際にはここにまずはインターフェースを定義し、コネクター側で実装します。

2017-07-01

Shallow Deep Work

「大事なことに集中する」という本を読んでみて、できるところから実践してみています。和名はこんなですが、Rebuild.fmで紹介されていたDeep Workに関する本です。

読んでみると色々思うことがあり、Donald Knuthのような偉大な人がメールすらやっていないのに、なぜ僕はSNSをやっているんだ!?という気持ちになります。ということで、SNSはfacebook, instagramは削除、twitterは基本閲覧のみ。twitterもできれば辞めたいなぁと思いつつ、たまに知りたかった情報が流れてくることもありとても悩ましい…。LINEも勢い余って削除したんですが、さすがに周りに多少迷惑がかかっていたみたいなので、これは再インストール。

携帯も会社用と個人用に2つ持っていたんですが、これもある意味気が散る原因な感じがするので会社用携帯に一本化してみました。個人用は契約しているデータ容量を下げて、会社用の番号に電話転送の設定を入れています。9月が解約月なので、この時に再考してみます。

仕事中は以前はメールは見た瞬間に返していたのですが、メーラーを常に起動しておくのを止めてみました。返信が必須ではないようなメールは朝イチ、昼休憩、帰る間際のどれかにだいたい返信しています。

本の中ではもっと極端なことが書いてあるのですが、まずはうえに書いたようなレベルで始めてみました。これで時間は以前より捻出できたと思ってますが、中身が伴わないと意味がないので、空いた時間で何をやるかというのは継続して考えていく必要があるなぁと感じています。

CMU DB Systems Spring 2017

Webで公開されているカーネーギーメロン大学のDBに関する授業を一通り見てみました。
CMU Database Systems Spring 2017

多くの情報工学科(CS)でCPUやOSに関する授業はあると思うんですが、DB関連で、SQLや正規化ではなく内部的な話をやる学校ってかなり珍しいのではと思っています。

L05 - Multi-Version Concurrency Control
Append-Only Storage: 全ての更新は新しいタプルをテーブルの空き部分に作成し、参照先をその部分に変更する
Time-Travel Storage: Time-Travelテーブルに複製し、メインテーブルのバージョンを上書きする
Delta Storage: 変更された値のみをDeletaストレージセグメントに複製し、マスターバージョンを上書きする
PKey Indexesはバージョンチェインのヘッドを常に指す。Append-Only
Secondary Indexes: Logical Pointers(Primary Indexに対してPrimary Keyを持つ)とPhysical Pointersの２つに分けられる
Microsoft HEKATON: OLTP Engine for MSSQL started in 2008

L06 - Index Locking & Latching
Index Locks vs. Latches
INDEX: improves the speed of date retrieval. The additional costs are writes and storage spaces.
Order Preserving Indexes: Maintain keys in some sorted order. O(log n)
Hashing Indexes: Maps a hash of the key to particular record. Only supports equality predicates with O(1)
B-tree: Values in all nodes in the tree
B+tree: Only stores values in leaf nodes

Locks vs. Latches

Separate User transactions vs. Threads
Protect Database Contents vs. In-Memory Data Structures
During Entire Transactions vs. Critical Sections
Deadlock Detection & Resolution vs. Avoidance

Latch Implementions

Blocking OS Mutex std::mutex
Test-and-Set Spinlock(TAS) std::atomic<T>
Queue-based Spinlock (MCS) std::atomic<Latch*>
Reader-Writer Locks

Index Locking Schemes

Predicate Locks: Shared lock on SELECT-WHERE, Exclusive lock on UPDATE, INSERT, DELEET-WHERE. Hard to implement
Key-Value Locks: Locks that cover a single key value
Gap Locks
Key-Range Locks
Hierarchical Locking: Allow for a txn to hold wider key-range locks with differecnt locking modes

L07 Latch-free OLTP Indexes
Bw-Tree T-Trees: Based on AVL Trees. Stores pointers to their original values. Uses less memory. Difficult to rebalance, implement safe concurrent access
Skip Lists: Insert and Delete don’t require rebalancing
Index Implementation Issues
Reference Counting: Increment the counting before accessing, Decrement it when finished. Bad performance for multi core CPU
Read-Copy-Update
Epoch Garbage Collection

L11 Database Compression

Block-level
Tuple-level
Attribute-level
Column-level

NAÏVE COMPRESSION
-> LZO(1996), LZ4(2011), Snappy(2011), Zstd(2015)

COLUMNAR COMPRESSION

Null Suppression (Useful in wide tables with sparse data)
Run-length Encoding
Bitmap Encoding (Only practical if the value cardinality is low)
Delta Encoding
Incremental Encoding
Mostly Encoding
Dictionary Encoding

Multi-Attribute Encoding
(original)
val1 val2
A 202
B 101
A 202
C 101
B 101

(compressed)
val1+val2
XX
YY
XX
ZZ
YY

val1 val2 code
A 202 XX
B 101 YY
C 101 ZZ

Order-Preserving Compression
(original)
name
Andrea
Joy
Andy
Dana
(compressed)

name
10
40
20
30
value code
Andrea 10
Andy 20
Dana 30
Joy 40

select * from users where ename like ‘And’
vs
select * from users name between 10 and 20

select * from users where ename like ‘And’
-> Still have to perform seq scan

select distinct * from users where ename like ‘And’
-> Only need to access dictionary

OLTP DBMS cannot use the OLAP compress techniques because we need to support fast random tuple access. Indexes consume a large portion of the memory fo an OLTP database.

L12 Logging Protocols
The gold standard for physical logging & recovery in a disk-oriented DBMS is ARIES Algorithms for Recovery and Isolation Exploting Semantics

Write-Ahead Logging
Repeating History During Redo
Logging Changes During Undo

SiloR uses the epoch-based OCC, it achieves high performance by parallelizing all aspects of logging, checkpointing, and recovery.

The logger threads write buffers out to files. After 100 epoches, it creates a new file
Log record format, Id of the txn that modified the record(TID), (Table, Key, Value)

update people
set islame = true
where name in ('Dana', 'Andy')

Txn#1001
[people, 888, (islame->true)]
[people, 999, (islame->true]

L13 Checkpointing Protocols
Checkpointing allows the systems to ignore large segments of the log to reduce recovery time.

Naïve Snapshots
Hyper - Fork Snapshots
Copy-On-Update Snapshot
VoltDB - Consistent Checkpoints
Wait-Free ZigZag
Wait-Free Ping-Pong

Facebook Scuba
By storing the database shared memory, the DBMS process can restart and the memory contents will survive
On shutdown, copy data from heap to shared memory

L14 Optimizer Implementation (Part 1)
Heuristic-Based Optimization
Define static rules that transform logical operators to a physical plan
Advantages

Easy to implement and debug
Works reasonably well and is fast for simple queries

Disadvantages

Relies on magic constants that predict the efficacy of a planning decision
Nearly impossible to generate good plans when operators have complex inter-dependencies

Heuristics + Cost-Based Join Search
Use static rules to perform initial optimization. Then use dynamic programming to determine the best join order for tables

Advantages

Usually finds a reasonable plan without having to perform an exhaustive search

Disadvantages

Left-deep join trees are not always optimal

Randomized Algorithms
Advantages

jumping around the search space randomly allows the optimizer to get out out of local minimums
Low memory overhead (if no history is kept)

Disadvantages

Difficult to determine why the DBMS may have chosen a particular plan
Have to do extra work to ensure that query plans are deterministic
Still have to implement correctness rules

Starburst Optimizer
Advantages

Works well in practice with fast performance

Disadvantages

Difficult to assign priorities to transformations
Rules maintenance is a huge pain

Volcano Optimizer
Advantages

Use declarative rules to generate transformations
Better extensibility with an efficient search engine

Disadvantages

All equivalence classes are completely expanded to generate all possible logical operators before the optimization search
Not easy to modify predicates

L15 Optimizer Implementation (Part 2)

Stratified Search
First rewrite the logical query plan using transformation rules and then perform a cost-based search to map the logical plan to physical plan
Unified Search
Unify the notion of both logical→logical and logical→physical transformations
Cascade Optimizer
Object oriented implementation of the Volcano query optimzer

Search Termination

Wall-clock Time
Cost Threshold
Transformation Exhaustion

L16 Cost Models
Cost model components are Physical costs, Logical costs and Algorithmic costs.
Postgres cost model uses a combination of CPU and I/O costs.
In memory is 400x faster then disk

Sequntial I/O is 4x faster than random I/O
IBM DB2 Learning Optimizer updates table stats as the DBMS scans a table during normal query processing.

L17 Query Execution & Scheduling

Process Models

Process per DBMS Worker

Each worker is a separated OS process
Use shared-memory for global data structures

Process Pool

A worker uses any process that is free in a pool
Bad for CPU cache locality

Thread per DBMS Worker

Single process with multiple worker threads
Thread crash (may) kill the entire system

Task Assignment

Push: A centralized dispatcher assigns tasks to workers and monitors their progress
Pull: Workers pull the next task from a queue, process it, and then return to get the next task

L18 Parallel Join Algorithms (Hashing)
Many OLTP DBMSs don't implement hash join.
Hash join is the most important operator in a DBMS for OLAP workloads.

Hash join
Phase #1 Partition
Phase #2 Build

Phase #3 Probe

Partition phase
Approach #1 Non-Blocking Partitioning (Shared Partitions or Private Partitions)
Approach #2 Blocking Partitioning (Radix)

Build phase
For each tuple, hash the join key attribute for that tuple and add it to the appropriate bucket in the hash table.

Probe phase
For each tuple in S, hash its join key and check to see whether there is a match for each tuple in corresponding bucket in the hash table constructed for R.

Hash functions
* MurmurHash (2008)
* Google CityHash (2011)
* Google FarmHash (2014)
* CLHash (2016)

Hash table implementations
* Chained hash table: maintain a linked list of buckets for each slow in the hash table.
* Open-Addressing hash table: Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table.
* Cuckoo hash table: Use multiple hash tables with different hash functions.

L19 Parallel Join Algorithms (Sorting)
SIMD
* Significant performance gains and resource utilization if an algorithm can be verctorized.

Moving data between DRAM and GPU is slow over PCI-E bus

Sort-Merge Join
Phase #1 Sort: Sort the tuples of R and S based on the join key
Phase #2 Merge: The outer relation R only needs to be scanned once

Sorting is always the most expensive part

Parallel Sort-Merge Join
Phase #1 Partitioning (optional): Partition R and assign them to workers / cores
Phase #2 Sort
Phase #3 Merge

Partitioning phase
* Divide the relations into chunks and assign them to cores

Sort phase
* Create runs of sorted chunks of tuples for both input replations
* It used to be that Quicksort was good enough
* Sorting networks is similar to ghost leg

Merge phase
* Iterate through the outer table and inner table in lockstep and compare join keys

Sort-Merge Join Variants
* Multi-way sort-mrge (m-way)
* Multi-pass sort-merge (m-pass)
* Massively parallel sort-merge (mpsm)

L20 Query Compilation
Code generation
* Transpilation
* JIT Compilation: generate IR(intermediate representation) of the query

L21 Vectorized Execution (Part I)
Mutli-Cores vs MIC(Many Integrated Cores)
Streaming SIMD Extensions (SSE) is a collection SIMD instructions that target special 128-bit SIMD registers
The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory

L22 Vectorized Execution (Part II)

Bitweaving
* Horizontal
* Vertical

L23 Larger-than-Memory Databases

OLTP Issues
* Run-time Operations ➝ Cold Tuple Identification
* Eviction Policies ➝ Timing and Evicted Tuple Metadata
* Data Retrieval Policies ➝ Granularity, Retireval Mechanism and Merging back to memory

L24 Non-Volatile Memory Database Systems

Fundamental elements of circutis
1. capacitor
2. Registor
3, Inductor
4. Memristor

Phase-change memory (PRM)
* A short pulse changes the cell to a '0'
* A long, gradual pulse changes the cell to a '1'

Resistive RAM (ReRAM)
Running a current one direction moves electrons from the top TiO2 layter to the botttom, thereby changing the resistance

Magnetoresistive RAM (MRAM)
Stores data using magnetic storage elements instead of electric charge or current flows

NVM for database systems
Block-addressable NVM is not that interesting.
Byte-addressable NVM will be a game changer but will require some work to use correctly

NVM-Aware memory allocator
#1 synchronization
➝ The allocator writes back CPU cache lines to NVM using the CLFLUSH instruction
➝ It then issues a SFENCE instruction to wait for the data to become durable on NVM

#2 Naming
➝ The allocator ensures that virtual memory addresses assigned to a memory-mapped region never change even after the OS or DBMS restarts.

About Me

2017-07-22

LINE Developer Meetup 19th

2017-07-17

Teradata SQL Memo

Aster SQL Memo

2017-07-15

Presto Parser Introduction

2017-07-01

Shallow Deep Work

CMU DB Systems Spring 2017