Dec 10, 2017

Machine Learning Book


同僚にオススメされた「仕事ではじめる機械学習」を読み終わりました。今までもいくつか機械学習系の本を読んで来ましたが、図やグラフが豊富で1番わかりやすかったです。

SVM

  • 線形カーネルとRBFカーネルの2つがある
  • RBFカーネルは線形分離不可能なデータも分類できる
  • 損失関数はヒンジ損失関数はを使う
  • マージン最大化により過学習を抑えることができる
  • 線形カーネルはテキストなどの疎データ、RBFカーネルは画像や音声信号などの密なデータによく使われる

ニューラルネットワーク

  • 非線形なデータを分離できる
  • 学習に時間でがかかる
  • パラメータの数が多いので過学習しやすい
  • 出力層の計算した値をsoftmax関数で正規化を行い確率としてみなせることが多い
  • 活性化関数にはReLUがよく使われる
  • 誤差逆伝播で学習する

k-NN

  • kは投票する個数のことを意味している
決定木、ランダムフォレスト、GBDT
  • ランダムフォレストは利用する特徴量の組み合わせをいくつか用意して性能が良かった学習器複数の予測結果を多数決で統合する
  • GBDTはサンプリングしたデータに対して直列的に浅い木を学習していく
  • 複数の学習結果を組み合わせる手法をアンサンブル学習という
次元削減のための手法は主成分分析やt-SNEが有名

異常検知にはOne Class SVMがある

分類の評価 正解率、適合率、再現率、F値

  • 正解率 正解した数/予測した全データ数
  • 適合率 実際に当たっている数/予測した数
  • 再現率 予測した正解/全データの正解
  • F値 適合率と再現率の調和平均

回帰の評価には平均二乗誤差を使用する

ヒンジ損失

  • 決定境界ギリギリで正解しているデータにもペナルティを与え、マージンを持たせている

配布されているデータのドメインが実際に使うものと異なる場合は、半教師あり学習や転移学習といった手法を用いることができる

A/Aテスト ランダム抽出により均質な2群が得られるはずだが、それを確認し、差がなければ片方に介入する

Eコマースにおける推薦システムの応用シーン

  • 概要推薦 今週の人気商品のような統計情報ベースのおすすめ。利用し始めやたまに利用するユーザーに効果的
  • 利用者評価 ユーザーが︎をつけた評価やコメントを利用して統計情報として見せたりする
  • 通知サービス
  • 関連アイテム評価
  • パーソナライゼーション

人生に何回もないようなイベントではデータが集まりづらいため、PVなど他の指標で代用するなどの仕組みが必要

強調フィルタリング

新規ユーザーや新アイテムに対する推薦データが不足するコールドスタート問題がある。内容ベースフィルタリングなら比較的推薦しやすい。しかし辞書のメンテナンスの必要がある。

テールに従って頻度が少なくなるロングテールな分布になることをジップの法則という。

Uplift Modeling

普通のA/Bテストと異なり母集団を実験群と統制群の2つにわけ、単に反応したかどうかを調べるのではなく、実験群と統制群においてどのような特徴量を持つ標本が反応したのか/しなかったのかを調べる。

  • 無関心 介入行為をしてもしなくてもコンバージョンしないセグメント
  • 説得可能 介入行為があってはじめてコンバージョンに転じるセグメント ←ここを最も発見したい
  • 天邪鬼 なにもしなければコンバージョンするが、介入行為を行うとコンバージョンしなくなるセグメント
  • 鉄板 介入行為をしてもしなくてもコンバージョンするセグメント 鉄板セグメントに対する広告出稿は反応率は良いが単価が低くなる可能性も高いため注意が必要

AUCC(Area Under the Uplift Curve)を使ってUplift Modelingの評価を行う

Dec 6, 2017

Dogical Thinking


本屋さんで「ニューヨークの魔法のかかり方」を探している時に「犬が伝えたかったこと」という本をたまたま見つけ、表紙の柴犬が可愛くてこの本も一緒に買って家に連れ込みました。
20人による犬に関するエピソードが載っていて、ただ可愛い部分や別れの悲しい部分を強調するのではなく、犬と暮らすというのはどういうことなのかが描かれています。エピソードの合間に正しいしつけの仕方や、犬にまつわる研究も紹介されているんですが、知らないことばかりで新鮮でした。ある研究によると、犬と過ごすと散歩などで必然的に健康的な生活になることから、全ての日本人が犬と暮らした場合、健康保険の国の負担額は4兆円少なくなるそうです。
特に印象に残っているのは、犬は記憶力は相当良いのに予測は苦手という部分です。「〇〇をしたら叱られた」ということは覚えられても、「じゃあ〇〇をやめよう」とはならないそうです。これを逆手にとって、無駄吠えをした時に、叱っても構わないですが、一度口を押さえて吠えなくさせた後に褒めてあげると、「吠えたら叱られた」と学習した後に「吠えなかったら褒められた」という結果を学習するとのことでした。感情だけで動いているように見えて、ある意味ロジカルな感じがして不思議です。
たまにネットでポメラニアンの動画とかを観ていると、この顔は何も考えてなさそう…!という感じがものすごいしますが、この本を読んだ後に思い返して、やっぱり何も考えてなさそう…!!という結論にいたりました。みなさまもぜひ、本屋で見かけたら手にとってみてください。中の挿絵がカラフルで可愛いので書籍版がオススメです🐶

Dec 2, 2017

Hive Study


Hadoopの象本を読み返してみています。2013年に出版された第3版ってことは、hpでデータ系の部署に配属された時に買ったんだっけな〜と朧げな記憶が蘇ります。今日はHiveの部分を重点的に。
Hiveは読み込み時点ではデータを確認せず、クエリの発行時に確認する。これを読み込み時のスキーマ適用と呼ぶ。
compact/bitmapインデックス
compact: 各ファイル内のオフセットではなく、HDFSのブロック番号を保存する。値が近傍の行にまとまっているような場合に効果的
bitmap: 特定の値が現れている行を、圧縮されたビット集合を使って保存数r。通常はカーディナリティの低い列に対して使用する
HiveQLはSQL92への準拠がゴールとして設定されたことはなく、ユーザーの要求を満たせるように機能追加してきた。
管理対象/外部テーブル
管理対象: load data inpathは移動を伴う、dropはメタデータとデータそのものを削除する
外部テーブル: テーブル作成時にlocationのパスをチェックしない、dropはメタデータのみを削除する  
バケット
同じ列でバケット化された2つのテーブルの結合処理は、map側結合として効率的に実装できる。

create table bucketed_users (id int, name string)
clustered by (id) into 4 buckets
;
create table bucketed_users (id int, name string)
clustered by (id) sorted by (id asc) into 4 buckets
;
select * from bucketed_users
tablesample(bucket 1 out of 4 on id)
;

複数テーブルへのinsert

from s1
insert overwrite table t1
 select c1, count(c1)
 group by c1
insert overwrite table t2
 select c2, count(c2)
 group by c2
;   

空テーブルの複製

create table t1_new like t1
;

ソート
sort by: reducer毎にソートされたファイルを生成する distribute by: 特定の行がどのreducerに行くのかを制御する

from records2
select year, temperature
distribute by year
sort by year asc, temperature desc
;

関数の確認

show functions
;
describe function length
;

map結合 1つのテーブルがメモリに収まるほど小さければ、小さい方のテーブルをmapperのメモリに読み込んで結合処理を行うことができる。

select /**+ MAPJOIN(things) */ sales.*, things.*
from sales
 join things on (sales.id = things.id)
;

ビュー
生成時点ではディスクにマテリアライズ化されず、ビューの参照時点で、ビューのselect文が実行される。
ユーザー定義関数
UDF: 1つの行を受け取り、1つの出力行を生成する
UDAF(User-Defined Aggregate Function): 複数行を受け取り、1つの出力行を生成する
UDTF(User-Defined Table-generating Function): 1つの行に対して処理を行ない、複数行を出力する
UDFを作成する際はorg.apache.hadoop.hive.ql.exec.UDFをextendsし、最低限evaluateメソッドを実装する必要がある Hiveへの登録するにはjarファイルとしてパッケージ化し、add jarおよびcreate temporary functionを実行する
add jar /path/to/hive-examples.jar
create temporary function strip as 'com.hadoopbook.hive.strip'

(UDAFの作成方法はUDFと異なる部分もあるが割愛)

Nov 27, 2017

Teradata IN & regexp_split_to_table

TeradataのSQLでin句の中をパラメータとして文字列として受け取る小技を紹介したいと思います。

まずはデータを準備します。

drop table t1
;
create table t1(c1 int, c2 varchar(1))
;
insert into t1 values (1, 'a')
;
insert into t1 values (2, 'b')
;
insert into t1 values (3, 'c')
;
select * from t1
;
c1, c2
1,  a
2,  b
3,  c

上記のテーブルで1行目と3行目のレコードをin句で取得したい場合、以下のようなクエリになるかと思います。

select
*
from t1
where c2 in ('a', 'c')
order by 1
;
c1,c2
1, a
3, c

このカッコ内の文字列をパラメータとして受け取る際に、IFの都合でシングルクォートが自動でエスケープされてしまうなんてことがあるかもしれません。以下のようなクエリが渡された場合、Teradataはシンタックスエラーになります。

select
*
from t1
where c2 in (''a'', ''c'')
;

Teradataに詳しい方はoreplaceでシングルクォート2つを1つにすれば…と考えるかもしれませんが、この方法だとひとつの文字列として見なされてしまうので、レコードがヒットしなくなります。

select
*
from t1
where c2 in (oreplace('''a'', ''c''', '''''', ''''))
;

そこで登場するのがregexp_split_to_tableというマイナーな関数です。split_to_tableとあるように文字列をテーブルのように縦に分割してくれます。

select
*
from table (regexp_split_to_table(1,'''a'', ''c''', ',', 'i')
 returns (id integer, tokennum integer, token varchar(10) character set unicode)) as dt
;
id, tokennum, token
1, 1, ‘a’
1, 2, ‘c’

必要なのはtokenカラムなのでこれをtrimで整形してin句の条件にしてあげればできあがりです。

select
*
from t1
where c2 in (
 select
 trim(both '''' from trim(both ' ' from token)) as token
 from table (regexp_split_to_table(1,'''a'', ''c''', ',', 'i')
  returns (id integer, tokennum integer, token varchar(10) character set unicode)) as dt
)
order by 1
;

c1,c2
1, a
3, c

regexp_split_to_tableは本来こういった使い方をするための関数ではありませんが、こんな関数もあったなと覚えておくといつか役に立つかもしれないですし、立たないかもしれません👻

Jupyteradata

Jupyter NotebookからTeradataに接続して色々試し(修行)中です。
横持ちのデータを綺麗に可視化する方法は分かったんですが、縦持ちのテーブルで例えば折れ線グラフで複数の線を描く時に、どうするのか(何の線かを定義するようなオプション)がPandas, Bokeh, Plotlyのどれも見つからず…。むむむむむ。
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import teradata
import pandas as pd
import numpy as np

plt.style.use('ggplot')
font = {'family' : 'meiryo'}
matplotlib.rc('font', **font)

udaExec = teradata.UdaExec (appName="Teradata", version="1.0", logConsole=False)
session = udaExec.connect(method="odbc", dsn="server name", username="user name", password="******");
cursor =  session.cursor();

sql = '''
select
 vproc
,currentperm
,peakperm
from dbc.tablesize
where
databasename= 'dbc'
and tablename = 'AccessRights'
;
'''
df = pd.read_sql(sql, session)
df.plot.bar(x=['Vproc'], y=['CurrentPerm', 'PeakPerm'], alpha=1.0, figsize=(12,3))
plt.title(u'Skew', size=16)

from bokeh.plotting import figure
from bokeh.io import output_notebook, show

p = figure(title="Bokeh Graph", x_axis_label='x', y_axis_label='y')
p.vbar(df['Vproc'], top=df['CurrentPerm'], width=0.2, bottom=0, color="#CAB2D6")
show(p)


Oct 16, 2017

(WIP)Teradata 14 SQL

Started preparing Teradata 14 SQL exam.

Teradata SQL - V14.10.6
-- Prepare
drop table t1;
create table t1(c1 int, c2 int);
insert into t1 values (1, 1);
insert into t1 values (2, 2);
-- SET
-- invalid
select * from t1
union
select * from t1
order by c1,c2
;

-- valid
select * from t1
union
select * from t1
order by 1,2
;
/* Macro */
create macro select1 as
(select 1;);

replace macro select1 as
(select 1;);

show macro select1;
help macro select1;

exec select1;
drop macro select1;

replace macro multiselect as
(select 1;
 select 2;);
exec multiselect;
-- returns 2

replace macro selectorder as
(
select * from t1 order by 1 desc
;);
exec selectorder;

replace macro plus1(num integer) as
(select :num+1;);
exec plus1(99); -- 100
exec plus1(); -- null

replace macro plus1_optional(num# integer) as
(select :num#+1;);
exec plus1_optional(1); -- 2
exec plus1_optional(); -- null

replace macro createt1 as
(drop table t1; create table t1(id int););
-- Data definition not valid unless solitary

replace macro createt1 as
(create table t1(id int););
exec createt1;

-- Parametizing an In-List
replace macro param_list(list varchar(100)) as
(
select * from (select 'teradata' as c1) t
where c1 in (:list)
;);
exec param_list('''hello'', ''teradata'''); -- No rows
exec param_list('teradata'); -- teradata

replace macro param_list2(list varchar(100)) as
(
select * from (select 'teradata' as c1) t
where position(c1 in :list) > 0
;);
exec param_list2('hello teradata'); -- teradata
exec param_list2('teradata'); -- teradata

Teradata Advanced SQL
-- CTAS
create table t2 as t1 with data;
create table t2 as t1 with no data;
create table t2 as (select * from t1) with data;
create table t2 as (select * from t1) with no data;
-- References(Foreign Key) constraints and Triggers (which reference source table) are not copied

-- index
create table t1 (c1 int, c2 int) unique primary index(c1);
create table t2 as t1 with no data unique index(c2); -- primary index(c1) and unique index(c2)
create table t2 as t1 with no data unique primary index(c1) unique index(c2); -- unique primary index(c1) and unique index(c2)
show table t2;


-- CTAS with subqueries
-- * Table attributes (FALLBACK) are not copied.  
-- * Secondary index are not copied.  
-- * First column listed becomes a NUPI unless otherwise specified.  
create table t1 (c1 int, c2 int) unique primary index(c1) unique index (c2);
create table t2 as (select * from t1) with no data; -- primary index(c1)
create table t2 as (select c2, c1 from t1) with no data; -- primary index(c2)

-- change column attribues
drop table t1;
drop table t2;
create table t1 (c1 int, c2 int) unique primary index(c1) unique index (c2);
create table t2 (c1 default 0, c2 check (c2 > 0) ) as (select * from t1) with no data;
create table t2 (c1 not null default 0) as (select c1 from t1) with no data;
insert into t2 values ();  -- (0) are inserted
Deribed Table

with d1(c1, c2) as
(select * from t1)
select * from d1;
Volatile Table

  • LOG is indicates that a transaction journal is maintained. (default)
  • NO LOG allows for better performance
  • ON COMMIT DELETE ROWS delete all rows after a commit
  • ON COMMIT PRESERVE ROWS keep table rows at transaction end
  • No secondary index arrowed
  • Statistics can not be collected
  • Up to 1000 volatile tables are allowed for a single session
  • Some Options(Permanent Journaling, Referential Integrity, CHECK constraints, Column compression, Column default values, Column titles, Named indexes )
  • CREATE/DROP INDEX, ALTER TABLE, GRANT, REVOKE are not applicable


-- Implicit transactions
-- Rows are deleted immediately after the inert for implicit transactions
-- Same for ANSI mode
create volatile table v1(id int) on commit delete rows;

bt;
insert into v1 values (1);
select * from v1; -- (1)
et;

select * from v1; -- No rows

-- Explicit transactions
create volatile table v1(id int) on commit preserve rows;

bt;
insert into v1 values (1);
select * from v1; -- (1)
et;

select * from v1; -- (1)
Recursive

with recursive all_trips(origin, destination, cost, depth) as
(
select origin, destination, cost, 0 from flights where origin='LAX'
union all
select all_trips.origin, flights.destination, all_trips.cost+flights.cost, all_trips.depth+1 from all_trips
join flights on all_trips.destination=flights.origin
and all_trips.depth<3
)
select * from all_trips order by 4
;

4. Global Temporary Table

The syntax is CREATE GLOBAL TEMPORARY
Space is charged against an allocation of temporary space
Users can materialize up to 2000 tables per session
They can survive system restart
They are tuneable CREATE INDEX and COLLECT STATS
The default mode of ON COMMIT is ON COMMIT DELETE ROWS, run this query to change it. ALTER TABLE t1 ON COMMIT PRESERVE ROWS

5. Windows Aggregate Functions

  • COUNT, SUM, MIN, MAX and AVG are standard aggregate syntax
  • RANK, PERCENT_RANK and ROW_NUMBER are extension
  • Nulls are ignored during aggregation, but do aggregate as groups
  • When more than one window aggregates is projected, the final sort is determined by the last one projected

Step Order

  1. where
  2. aggregation
  3. having
  4. olap (partiton by, order by rows)
  5. qualify [order by]
  6. RANDOM
  7. sample, top n
  8. order by
  9. format

6. Windows Aggregate Functions 2

Non-ANSI, ANSI
csum, sum over
msvg, avg over
msum, sum over
mdiff, N/A
rank(column), rank over
quantile, rank over

Cumulative SUM
sum(c1) over(partition by p1 order by o1 rows unbounded preceding)

Moving SUM
sum(c1) over(order by o1 rows 2 preceding)

Moving AVG
sum(c1) over(order by o1 rows between 2 preceding and 1 preceding)

Moving DIFF
c1 - min(c1) over(order by o1 rows between 7 preceding and 7 preceding) or mdiff(c1, 7, o1)

Remaining Window
sum(c1) over(order by o1 rows between current row and unbounded following)

Moving Window
sum(c1) over(order by o1 rows between current row and 2 following)

Reset When
sum(c1) over(order by o1 reset when c1 is null rows unbounded following)

Preceding includes the current row, Following excludes the current row 0 Preceding and 0 Following include the current row

7. RANK
rank() over(partition by p1 order by o1)
row_number function is an ANSI standard function, disregards any tied value
row_number() over(order by o1 reset when r1 is null)
dense_rank() function is an ANSI standard function
rank() over(order by o1 with ties low)
rank() over(order by o1 with ties high)
rank() over(order by o1 with ties dense)
rank() over(order by o1 with ties avg)

percent_rank() is an ANSI standard function and is expressed as an approximate numeric ratio between 0.0 and 1.0

pecentile_cont(?)
percent_cont(0.3) within group (order by c2 (dec(8,2)))pecentile_disk
percent_disc(0.3) within group (order by c2 (dec(8,2)))
cume_dist = RankHigh / Num Rows in Group
first_value(c1 ignore nulls) over(partition by p1 order by o1 rows between 3 preceding and 1 following)
last_value(c1 respect nulls) over(partition by p1 order by o1 rows between 3 preceding and 1 following)

8. Extended Grouping Functions ROLLUP, CUBE, GROUPING SETS

-- rollup
select
 c1
,c2
,sum(c3)
from t1
group by rollup(1, 2)
order by 1 desc, 2 desc
;

-- grouping
select
 case grouping(c1)
  when 1 then 'total'
  else coalesce(c1, 'value is null') /* else or when 1*/
 end
,sum(c3)
from t1
group by rollup(c1)
;

-- two colulmns into one column
select
 c1
,c2
,sum(c3)
from t1
group by rollup((1, 2))
order by 1 desc, 2 desc
;

-- rollup summarizes from right-to-left
-- cube summarizes from right-to-left and left-to-right

-- cube
select
 c1
,c2
,sum(c3)
from t1
group by cube(1, 2)
order by 1 desc, 2 desc
;

-- cube and grouping
select
 case grouping(c1)
  when 1 then 'c1 total'
  else coalesce(c1, 'value is null')
 end
,case grouping(c2)
  when 1 then 'c2 total'
  else coalesce(c2, 'value is null')
 end
,sum(c3)
from t1
group by cube(c1, c2)
;

-- combine grouping in a cube
select
 c1
,c2
,c3
,sum(c4)
from t1
group by cube((1, 2), 3)
;

-- grouping sets
select
 c1
,c2
,sum(c3)
from t1
group by grouping sets(c1, c2)
;

-- adding grand totals
select
 c1
,c2
,sum(c3)
from t1
group by grouping sets(c1, c2, ())
;

group by cube(c1, c2)
group by grouping sets((c1, c2), c1, c2, ())
group by grouping sets(c1, ()) grouping sets(c2, ())
9. QUANTILE and WIDTH_BUCKET

-- quantile
select
 c1
from t1
qualify quantile(100, c2) = 20
;

-- aggregates
select
sum(c1)
from
( select c1 from t1 qualify quantile(100, c2) = 20 ) as t
;

-- qualify and group by
select
 c1
,quantile(100, c2)
from t1
group by 1
;

-- ordergin qualilfy
select
 c2
,quantile(100, c2 asc)
from t1
;

-- percent_rank
select
 c2
,percent_rank() over(order by c2)
from t1
;

-- percent_rank derives a value between 0.0 and 1.0
-- percentile is a number between 0 and 99

-- width_bucket(column, lower, upper, partition_count)
-- number of buckets created is always partition_count+3
-- 1. 0 bucket for all values less than the lower
-- 2. n+1 for all values greater than the upper
-- 3. null backet for null values
select
 c1
,width_bucket(c1, 1, 3, 10)
from t1
;

10. Correlated Subqueries NOT IN can have issues when nulls are involved
NOT EXISTS does not share those issues due to row-at-a-time logic

11. Scalar Subqueries NCSSQ: Inner query does not reference the outer table
CSSQ: Inner query does reference the outer table

select
 c1
,(select max(c2) from t1)
from t1
;

12. Date-Time Data Types The date format for the system is defined in the “DATEFORM” setting of DBS Control

You can change or override the DATE format settings in any of the following ways:

Set/Change the system default date format in the SDF file (integerdate only)
Choose the DATEFORM at the system level in the DBS Control Record

  • Set the DATEFORM at the user level with the CREATE USER
  • Set the DATEFORM at the session level using SQL
  • Specify a format in a FORMAT phrase
  • modify general 14=0 /* integerdate (YY/MM/DD) */
  • modify general 14=1 /* ansidate (YYYY-MM-DD) */


create user dlm...
 dateform=ansidate
         =integerdate

set session dateform=ansidate
                    =integerdate

select date(format '----')

SQL Assistant may be affected from several settings. including OOL -> Options -> Data Format

Centrury-Break setting effects the system’s interpretation of dates only when FORMAT uses YY

If you use Centrury-Break setting of 40 (YY/MM/DD), ‘40/12/31’ become 1940/12/31 and ‘39/12/31’ become ‘2039/12/31’.

Time(n) n=0-6 (default is 6)
HH BYTEINT (1 byte)
MI BYTEINT (1 byte)
SS DEC(8,6) (4 bytes)
Time(0) HH:MI:SS CHAR(8)
Time(6) HH:MI:SS.nnnnnn CHAR(15)
  Time data types are imported and exported as CHARACTER data using FORMAT

-- setting timezone
modify general 16=n /* n=-12 to+13 */
modify general 17=n /* n=-59 to+59 */

create user dlm...
  timezone=local
          =null
          ='08:00'
          ='-04:30'

set time zone local
set time zone user
set time zone inteval '05:00' hour to minute

-- extract
select extract(timezone_hour from current_timestamp)
select extract(timezone_minute from current_timestamp)
13. Interval Data Types and Manipulations

select interval '05-13' year to month -- 6-01
select interval '03:12' hour to minute -- 3:12

select interval '02' year

select (date '2009-01-31' + interval '1' month) -- fail

select add_months(date '2009-01-31', 1) -- 2009-02-28

select (date '2009-01-31' - date '2010-02-15' year to month ) --  -1-01

select interval '1' year (interval month(4)) -- 12

14. Timestamps and Timezones (P.637)

Round with DATE
select round(cast('2017-10-02' as date), 'cc');
-- 2001-01-01
select round(cast('2017-10-02' as date), 'year');
-- 2018-01-01
select round(cast('2017-10-02' as date), 'month');
-- 2017-10-01
initcap

select initcap('software enginner');
-- Software Enginner
nvl & coalesce

select nvl(null, 1);
select coalesce(null, 1);
-- 1
DECODE

select decode(1, 1, '1st',
                 2, '2nd',
                 'other');

-- 1st
Tips

  • When using the TOP N function, unless there is an ORDER BY clause, the WITH TIES option is ignored
  • SHOW COLUMN is invalid, HELP COLUMN is valid
  • DATE and TIME are not ANSI starndard function
  • The FLOAT data type has 15 digits of precision, DECIMAL has 38 digits.
  • RANDOM function is a Teradata extension
  • A subquery cannot perform a SAMPLE, derived tables can perform a SAMPLE
  • Top N option is not supported in subquery
  • WITH TIES options is ignored without an ORDER BY option
  • A recursive queries without preventing infinite loop may dead-end on their own
  • Grand totals are always retrieved when using CUBE and ROLLUP
  • The format for integerdate is yy/mm/dd, ansidate is yyyy-mm-dd
  • The default for a timestamp data type is TIMESTAMP(6)
Backlog

  • Procedure


References

Oct 14, 2017

The September Issue


2009年に公開された「The September Issue」を観てみました。VOGUE 9月号が出来上がるまでのドキュメンタリーです。 Anna WintourだけでなくGrace Coddingtonにもフォーカスされていて、Graceの気さくな言い回しが映画をより面白くしています。
映画の中でGraceが引用していたこの言葉が印象的でした。
Always keep your eyes open. Keep watching. Because whatever you see can inspire you

Sep 28, 2017

Aster Basic Book


Below are personal memo about the book, Teradata Aster BASICS.

Chapter 1
Four key characteristics of Big Data?
Complexity, Variety, Velocity, Volume

Chapter 2
During Partition Splitting (Adding v-Workers), Aster Database is unavailable until the process is completed
  Adding Worker Nodes (More processing power per v-Worker)
  Adding v-Workers(Partition Splitting) More parallelism across the system
Queen node be RAID5
Worker and Loader be RAID0 for small configurations
Worker and Loader be RAID5 for large configurations

Chapter 3
What is QueryGrid?
is a client-based user interface that leverages the updates
manages optimization so that each platform does what it does best when resolving the query request

Chapter 5
MVCC: The result is that over time, the table can contain logically deleted (invisible) rows that are no longer used, but are consuming table space

Chapter 6
Multi-structured data: consists of files that contain a wide variety of different formats and data types in a non-fixed manner that must be parsed interpreted properly

Chapter 8
Map/Row function must implement the operateOnSomeRows method
Reduce/Partition function must implement the operateOnPartitionmethod

Chapter 10
Rows returned in the RESULT operator are always?
Aggregations

Sep 23, 2017

Introduction to Teradata Aster Analytics

This a note for Teradata Aster Basics 6.10 Exam a.k.a TACP(Teradata Aster Certified Professional).
Recommended courses are followings and this note is for the 3rd course.

  • Teradata Certification, What’s New and How to Prepare
  • Introduction to Big Data and Teradata Aster*
  • Introduction to Teradata Aster Analytics
  • Introduction to Teradata Aster Database Administrator*

Map function doesn’t have PARTITION BY
Reduce function has PARTITION BY
PARTITION BY affects SHUFFLE phases

Quiz
What are the two different types of SQL-MR functions? (Choose two.)
  Partition functions (Reduce) & Row functions (Map)
In SQL-MR the ON clause can be what three things? (Choose three)
  Function, Table & Query
Fill in the missing word. How do you distinguish between a Map vs. Reduce function?
  A REDUCE function has a PARTITION BY clause, whereas the Map function does not
What criteria would you use to determine if you want to run SQL versus SQL-MR? Select the four criteria that are better suited to run with SQL-MR. (Choose four)
  Unstructured or multi-structured data, Machine learning algorithm, Recursive querie & Self-joins

Acquistion functions

  • load_from_hadoop
  • load_to_hadoop
  • load_from_hadoop_dir
  • load_from_pst
  • load_tweets
  • anydatabase2aster
  • load_from_s3
  • load_to_s3

Define foreign server
create foreign server hdp21
using server('192.168.100.21')
dbname('default') username('hue')
do import with load_from_hcatalog,
do export with load_to_hcatalog;

create foreign server td15
using tdpid('192.168.100.15')
username('td01') password('td01')
do import with load_from_teradata,
do export with load_to_teradata;
Pull & Push-down query

--pull
select c1, sum(c2)
from [email protected]
group by 1;

--push down
select * from FOREIGN SERVER
($$ select c1, sum(c2) from t1 group by 1 $$)@td15;

Quiz
Which two Teradata QueryGrid connectors can acquire data for Aster? (Choose two)
  Aster-to-Teradata & Aster-to-Hadoop
Why move data between Teradata and Aster? Match Aster and Teradata to what each database is best designed for.
  Aster - for analytics by limited number of data scientists
  Teradata - for high concurrency (hundreds of users)
What are some Teradata Aster parser functions?
  Apache logs, xml, json and pst
Using the Stream API, you can write functions in programming languages that are not native to Teradata Aster (e.g., write non SQL-MR or SQL-GR functions) and run them on Aster, generating output that Aster can receive, including:

  • writing R functions to run on Aster
  • write custom python, perl, C/C++/C# functions to run on Teradata Aster


Quiz
nPath is used for Pattern Matching across Time Series
What three expressions are used to specify input data for nPath? on, partition by and order by What three expressions are used to specify nPath search criteria?
  mode, pattern and symbols
What kind of function is Kmeans?
  clustering
What kind of function is Decision Tree?
  predictive function

Quiz
What visualization function(s) are in Teradata Aster AppCenter?
  Visualizer (formerly nPathViz and cFilterViz)
What needs to be configured before building a new Application?
  Create a JDBC connection
Name three Data Format types. (Choose three)
  nPath, Table and cFilter
Name four different chart types that Teradata Aster AppCenter visualizations create. (Choose four)
  Tree, SanKey, Sigma and Chord
How can users dynamically change Teradata Aster AppCenter chart visualizations?
  By clicking on objects and/or by changing Layout/Format specs

Quiz
How do you connect to Aster via RStudio?
  Aster ODBC driver
What is the name of the Aster package for Teradata Aster R?
  TeradataAsterR
You want to access Help for Teradata Aster R to see a list of commands. What syntax would accomplish this?
  help(package=’TeradataAsterR’)


Final Exam SCORE: 96 PASSED Question 1 Correct

  1. True or False: Map-Reduce is a programming model and an associated implementation for processing and generating large data sets.
Your answer: True Question 2 Correct
  2. Each Map function performs an ETL on ____ in the input.
Your answer: all rows Question 3 Correct
  3. The ___ gets a key and the array of values emitted with that key and produces the final result.
Your answer: Reduce Function Question 4 Correct
  4. The SQL-MR syntax ON clause specifies the input rows, which can be a ___. (Choose four)
Your answer: Table,View,Sub-query,SQL-MR function Question 5 Correct
  5. Does the syntax use a Map Function or a Reduce Function? Drag and Drop the Map Function and Reduce Function labels (at left) to the correct syntax (at right).
Your answer: 1-2,2-1 Question 6 Correct
  6. 6. True or False: Functions can be Map and Reduce functions at the same time.
Your answer: False Question 7 Correct
  7. In the syntax below, click on the input.
Your answer: 5 Question 8 Correct
  8. 8. Match the function (at left) with its description (at right):
Your answer: 1L-1R,2L-2R,3L-3R,4L-4R Question 9 Correct
  9. The _____ is used for clustering. Clustering is a fast/simple method for grouping objects into preliminary clusters using an approximate distance method. Each point is represented as a point in a multidimensional feature space.
Your answer: Canopy function Question 10 Correct
  10. True or False: Map-Reduce is a programming model and an associated implementation for processing and generating large data sets.
Your answer: True Question 11 Correct
  11. The _____ can extract multiple columns of structured data from standard Apache Web Logs.
Your answer: Apache Log Parser Question 12 Correct
  12. Match the function (at left) with what it’s used for (at right):
Your answer: 1L-1R,2L-2R,3L-3R,4L-4R,5L-5R,6L-6R,7L-7R,8L-8R Question 13 Correct
  13. This question tests your knowledge of nPath pattern matching using the mode: non-overlapping and the pattern: ‘B+.C.A’. Given this input table and nPath syntax, which pattern matches will be in the output rows?
Your answer: 1 row: BBBCA Question 14 Correct
  14. This question tests your knowledge of nPath pattern matching using the mode: overlapping and the pattern: ‘B+.C.A’. Given this input table and nPath syntax, which pattern matches will be in the output rows?
Your answer: 3 rows: BBBCA, BBCA, BCA Question 15 Incorrect
  15. True or False: In Teradata Aster a single SQL-MR statement can call all of the necessary functions to go through acquiring data, to preparing it, to the multi-genre analyzing of it, and finally to visualizing it.
Your answer: False Correct answer: True Question 16 Correct
  16. In Teradata Aster the ____function creates a row in a visualization table where Teradata Aster AppCenter can access it, view it, and manipulate it.
Your answer: ‘Visualizer’ Question 17 Correct
  17. True or False: Teradata Aster R packages addresses the Challenges of R by allowing programmers to scale R analytics by leveraging Teradata Aster.
Your answer: True Question 18 Correct
  18. Before beginning able to connect to the Teradata Aster cluster and start issuing Teradata Aster R commands, you must do which two things? (Choose two) Your answer: Install/configure Teradata Aster 6.20 ODBC driver,Install RODBC and Teradata Aster R packages
Question 19 Correct
  19. What are RMapReduce runners?
Your answer: Functions to run R-code in Teradata Aster Question 20
  20. Match the Teradata Aster R function to what you would use it for:
Question 20.1 Correct Your answer: 1L-1R Question 20.2 Correct Your answer: 2L-2R Question 20.3 Correct Your answer: 3L-3R Question 20.4 Correct Your answer: 4L-4R Question 20.5 Correct Your answer: 5L-5R


Sep 21, 2017

Introduction to Teradata Aster Database Administration


This a note for Teradata Aster Basics 6.10 Exam a.k.a TACP(Teradata Aster Certified Professional).
Recommended courses are followings and this note is for the 2nd course.
  • Teradata Certification, What’s New and How to Prepare
  • Introduction to Big Data and Teradata Aster
  • Introduction to Teradata Aster Analytics
  • Introduction to Teradata Aster Database Administrator
nc_system schema holds system information.
3 categories of DD views
  • nc_all
  • nc_user
  • nc_user_owned
Replication Factor
  • RF=1: No secondary v-wokers. No fallback
  • RF=2: If a woker goes down, secondary v-worker will be promoted to the new primary v-worker. Max is 2. Primay and its replica are not located in the same Worker node.
Ganglia is a open source, web-based, scalable distributed system monitoring tool.
AMC Status
  • Green: operating normally
  • Blue: decrease in performance
  • Yellow: unable to process statement requests
  • Red: stopped
  • White/Clear: no longer able to establish a connection
Aster Database only supports B-tree indexes, cannot enforce referential integrity.
There is no data sharing among Aster databases.

/*Change database*/
beehive=> \connect retails_sales;
retails_sales=>
retails_sales=>database beehive;
beehive=>

/*help*/
beehive=>\?

/*List database*/
beehive=> \l

/*Exit database*/
beehive=> \q

/*List schemas*/
beehive=> \dn

/*View tables in the PROD schemas*/
beehive=> \dt prod.*

/*View columns/data types*/
beehive=> \d prod.sales_fact

/*Show current schema*/
show search_path;
ALTER USER beehive SET SEARCH_PATH = 'public', 'mkt';

Select table by walking the path until finding the name
CONNECT privilege must be given to access the database. USAGE privilege must be given to access the schema.
Two Serial types: Global and Local.
  • A Serial Global type ensures the serial property across all of the nodes in the system.
  • A Serial Local type ensures the serial property local to each logical partition of data.
PARTITION BY RANGE: START include the value but END exclude the value
partition sales_june (START'2017-06-01'::date END'2017-07-01'::date)
PARTITION BY LIST: If an incoming row doesn not fit into any partition, that row will not be loaded into the table
1. Data Modeling Quiz
  • Q. Best schemas for Teradata Aster databases
    A. Star schema and Snowflake schema
Aster column name rules
  • Starts with a character
  • Must be < 63 characters
  • Names may include special characters (_ , $)
Constraint Options
  • Null/Not Null
  • Primary key
  • Default values
  • Check values
create table stuff
(
emp int NOT NULL PRIMARY KEY,
dept varchar DEFAULT 'none',
age smallint CHECK(age >= 18 and age <= 70),
name varchar
)
distribute by replication
;

Data Types
  • CHAR, CHARACTER VARYING, VARCHARA(n) maximum is 10MB
  • TEXT is unlimited
  • Special type are Boolean, Bytea, Serial, Big Serial
Supported data types of distribution
  • smallint
  • integer
  • bigint
  • numeric
  • text
  • varchar
  • uuid
  • bytea
For large tables (> 1million rows, usually Fact table)
For small tables (<= 1million rows, usually Dimension table)
If ASH key and JOIN columns doesn’t match, SHUFFLE will occure.
TRUNCATE: Quickly remove all rows and it reclaims disk space immediately
VACUUM: Converts dead space into usable free space
VACUUM FULL ANALYZE: Physically rearrange the data on disk
NC_RELATIONSTATS: Generate various reports

2: Creating Tables Quiz
Tables in a Teradata Aster Database can be of which four variations? (Choose four.)
  • Temporary(Fact/Dim)
  • Analytic(Fact/Dim)
  • Fact
  • Dimension
What data type is commonly used for “payload” columns? Click on the correct data type in the image.
TEXT
In Teradata Aster, table data may be partitioned in which two ways? (Choose two.)
Logically Partitioned tables (Logical)
Fact tables(Physical)
How do these two partitioning types improve performance? Match the partitioning type to how it improves performance.
Physical: More v-Workers equal more parallelism
Logical: Reduced disk I/O by only reading needed partitions
Scenario: You join 2 FACT tables where the Hash column matches the JOIN column. Will a shuffling of data occur?
No, the JOIN will commence immediately since JOIN column values are guaranteed to be on the same v-Worker.
nCluster loader arguments
  • -B –begin-script
  • -E –end-script
  • -d –dbname
  • -D –delimiter
  • -c –csv
  • -l –loader
  • –truncate-table
  • -w –password
  • -z –auto-analyze
  • -U –username
  • -p –port
  • –el-enabled
  • –skip-rows-1
Default delimited format is TSV
-B and -E specify script name to execute it
Parallelizing the Load tier
  • Add more loader nodes
  • Add more staging machines
  • Add more nCluster loaders running on the staging machines
Error logging is turned off by default. This means the load job will abort and rollback the data on encoutering the first error.
  • –el-enabled
  • –el-limit <#>
  • –el-table
  • –el-label
  • –el-errfile
ncluster_export example
ncluster_export -h 192.168.100.100 -d beehive -U beehive -w beehive
\"aaf\".\"accesslog\" myfile.txt

3: Data Loading Quiz
What is the name of the Teradata Aster Database bulk loading tool? ncluster_loader
Which two node types can handle Teradata Aster data loading ? (Choose two.)
Loader nodes and Queen node (if there are no loader nodes)
Which task do Loader nodes perform during loading?
Hashing the Distribution Key for v-Worker placement
The loading tier can be scaled in which three ways? (Choose three.)
Add more nCluster Loaders, Loader Nodes, Staging Machines
In addition to the nCluster Loader Tool, which four other types of tools are used to load a Teradata Aster Database? (Choose four.)
ETL Tools, SQL Statements, Connectors, Teradata QueryGrid (Aster-to-Hadoop, Aster-to-Teradata)

Final Exam
  • Q. You have 5 Teradata Aster Databases. How many Data Dictionaries do you have?
  • A. 5 - one for each Teradata Aster Database
  • Q. Which two statements are true regarding a Teradata Aster Database? (Choose two.)
  • A. Each user must be given the CONNECT privilege on a database to access objects on the database,
  • By default, there is one database in a new installed Teradata Aster cluster called, beehive
  • Q. True or False: The Aster Loader Tool must point to the Queen and can optionally point to the Cluster Loader Node for hashing. A. True
  • Q. Which two statements are true regarding a Teradata Aster Database? (Choose two.)
  • A. Data objects may be shared across schemas in the same database,Users can join tables from one schema with tables in another schema if they have proper privileges for the schemas/tables

Sep 20, 2017

RubyKaigi 2017


Followings are my personal memo, so sorry about the dirty and lack of sentence. Even if there’s less memo, it doesn’t mean the talk’s quality is low. I just concentrated on listening the talk.

Day1

Matz team. (are only Matz and Nobu)
Daily: Debugging, New features, Bug making etc
Why not Git? Windows is not supported officially. Not enough advantage
Developers’ meeting are held once per month
How to build Ruby is configure and make
BASERUBY: pre-installed ruby. generate source files
MINIRUBY: ruby made during the build. No dynamic loading. Unable to load extension libraries.
Mimic global variables used in mkmf.rb by trace_var
Following is exactly not a bug
p = 2
p (-1.3).abs = -1.3
Demon Castle parse.y by name
Monstrous lex_state
literal symbol by intern
Refining String#intern returns no-symbol
New features in 2.5?
Such as $. 2.3
Unicode case 2.4
Approved Array #append, #prepend
Rejected neko ^..^ operator (make range), User-defined operator
Under Discussion Method extraction operator (Kernel#method -> Method instance), Rightward assignment
Wouldn’t you write New Ruby?
Ruby History
to_json
jbuilder is very slow
ActiveModel::Serializers JSON-Schema API Blueprint(apiary), OpenAPI(Swagger), RAML, JSON Hyper-Schema, bare JSON Schema
Why choose OpenAPI? It has RESTful definition than API Blueprint
OpenAPI is developed by Swagger originally. Give type into OpenAPI
Use $merge and $ref to OpenAPI porting
GraphQL is API query language like SQL, WYSIWYG, has only 1 endpoint
Ecosystem is still insufficient
BFF is the abbreviation of Backend for Frontend
5000 warmup iterations increase 5-7%
Code is available in noahgibbs/rails_ruby_bench Rails version is 4.2.6
PyCall runs in Ruby interpreter
Use pandas in Rails app
Python is a best friend of Ruby from now on!
PyCall should be a temporary way
Red Data Tools project will be a home
Apache Arrow aims to be Arrow Memory for common tools
Red Arrow is Ruby’s one for Apach Arrrow
Sutou-san (@kou) officially became a member of PMC of Apach Arrow yesterday
Jupyter Notebook also supports Ruby
Python is managed by reference counting
Some types(string, dictionary etc) are converted from python to ruby primitive type
Ruby Commiters vs the World
Method chain is left to right, but why a substituion is left? is the motivation for rightward. Use method chain (e.g. assign_to)
Matz likes Swift and Closure except for Ruby, Emacs Lisp and Stream
ActiveSupport is made for web application
Day2
RubyKaigi is the biggest ruby conference!
Talk about talent
Love of languages
First language Matz used is Basic
Matz posted about Ruby to Python mailing list
Simula (1968) first object oriented language
Created by Dr. Kristen Nygaard
Lisp (Flavors)
C3 linearization algorithm
Alias method chain vs Module#prepend
CLOS(Common List Object System) Method combination enables hook before/after/around method calls
Gorilla, Guerrilla, Monkey patching has global influence
Scoped monkey patching?
Line coverage, Branch coverage, Path coverage
C0, C1, C2 coverage
Coverage is just a measure, not a goal
SimpleCov is a wrapper for coverage.so
Method definition is counted as an execution
Concov detets temporal change of coverage
13901 is ticket
RuboCop is an industry standard solution
RubyMine can detect more errors than RuboCop
Coverage is a lie
YARV compiles code into the bytecode
JetBrains/ruby-type-inference
JIT Type Checking for Dynamic Languages (Ren, 2016)
def foo(x)
  "".bar if x
end

foo(false)
Steep check types with annotations
Editor specific library & Universal LSP Client plugin
Auto complete & Jump definition is WIP
mtsmfm/language_server-ruby
Syntax check uses ruby -wc filename.rb Auto completion uses rcodetools
Day3
Copy on Write(CoW)
memsize_of, dup
CoW Page Fault cause bad performance
Unicorn is a forking webserver which has parent and many childs
Garbage Collector affects CoW
GC Compactions means move objects to avoid OS replicate many pages
Some people said “Moving objects is impossible”
Aaron tried ‘Two Finger Compaction’
Disadvantages slow
Advantage easy
Object Movement uses two finger. One is Free Pointer the other is Scan Pointer
Free Pointer detect free address, Scan Pointer detect used address until the two finger meet
What objects can move?
Everything!?
GC can knows ruby’s reference easily
If reference is C extension, it’s difficult
hash_key uses memory address: fix cache hash key
rb_gc_mark Dual References: fix Call rb_gc_mark, or use only ruby
rb_define_class
string literals It seems like nothing can move, but most can be fixed
46% can move!
ObjectSpace.dump_all can output as json
tenderlove/heap-utils
/proc/{PID}/smaps
Question your assumptions
Replication with Quoram
Create 2 replica of data at least (max 3). IF first response is ok, discard a thread for another node
Bigdam: Edge locations on the earch + the Central location
Bigdam-pool: Distributed key-value storage. to build S3 free data ingestion pipeline
chunk id guarantees the uniqueness
Ruby 2.5 supports block-wide rescue

JRuby’s Startup Warmup phase is slower than MRI
Have to lead tons of java classes
Warmup phase has JRuby(Ruby Interpret, Ruby JIT) & JVM(Java Interpret, Java JIT)
Graal is new JIT for all languages on the JVM
jruby/jruby
RTL
In special case, when GCC optimize simple loop, the byte code doesn’t have loop, but JVM cannot do this
MJIT status
* Unstable, doesn’ work on Windows, one more year to mature
* No inlining yet(most important optimizations!), Use C inlining, new GCC/LLVM extension
* Will RTL and MJIT be a part of MRI?
References
Slides

This is my first RubyKaigi but really interesting. Especially, Aaron’s talk left an impression on me. Of course Matz’s talk is also. Until this conference, I simply liked writing Ruby, but now I like Ruby and the community. (Uh, sounds like a poem) Matsuda-san announced that the next RubyKaigi is held in Sendai, Miyagi (June 31st - July 2nd). I will definitely go again next year!


Atomic-Bomb Dome

Peace Memorial Park

Hiroshima Castle

Aug 21, 2017

Introduction to Big Data and Teradata Aster


This a note for Teradata Aster Basics 6.10 Exam a.k.a TACP(Teradata Aster Certified Professional).
Recommended courses are followings and this note is for the 2nd course.
  • Teradata Certification, What’s New and How to Prepare
  • Introduction to Big Data and Teradata Aster
  • Introduction to Teradata Aster Analytics*
  • Introduction to Teradata Aster Database Administrator*
SQL vs SQL-MR: SQL is better for standard transformation. SQL-MR is better for custom transformation(e.g. log extraction)
R creates multiple copies of data during processing, and doesn’t automatically run in parallel. Aster R run in parallel across the Aster MPP architecture.
FSE(Foreign Server Encapsulation): Supports remote data platforms other than Aster and Teradata. (e.g. Oracle, Hadoop, DB2, etc)
QueryGrid Aster-Teradata: Join tables in Taeradata and Aster Database
QueryGrid Aster-Hadoop: Copy data from Hadoop to Aster, from Aster to Hadoop. HCatalog: Table metastore service for Hive, Pig, and so on.
Deployment Options: Aster Apliance, Cloud, Software Only(RHEL) and Aster on Hadoop.
Data Prepartion: IPGeo, Pivot, JsonParser, Apach Log Parser and PSTParserAFS
Aster Analytics Portfolio
  • Data Acquistion
  • Data Prepartion
  • Advanced Analytics
  • Visualization
Aster Database
  • Analytic Engine
    • Aster SQL-MR
    • Aster SQL-GR (Based on Bulk Synchronous Processing)
    • Aster R
  • SNAP Framework
    • Integrated Optimizer
    • Integrated Executor
    • Unified SQL Interface
    • Common Storage System and Services
  • Multi-Type Storage
  • AFS(Aster File Store)
Queen: Cluster Coordination, Distributed Query Planning, System Tables
Worker Node: Send back results to Queen
Loader: Loading data to Aster
Access Control
  • Aster username/password
  • TD Wallet
  • LDAP
Multi-Version Concurrency Control(MVCC): Eliminate the needs of read locks while ensuring that the database maintains the key ACID(Atomicity, Consistency, Isolation, Durability)
Two Level Query Optimization
  • Queen Global Optimizer: Rule Based
  • v-Worker Optimizer: Cost Based. The cost is determined by the demographics of the v-Worker fragment of the distributed data.
Dynamic Workload Management
  • User-based policies
  • Time-based policies
  • Object-based policies
  • IP-based policies
  • Periodic Re-evaluation
nCluster’s columnar capability is a custom development of Aster. Not part of PostgreSQL. Columnar limitation is append only(no updates or deletes)
Columnar advantage and limitation
  • Use NOT NULL whenever possible
  • Avoid variable length data
  • Don’t SELECT/ANALYZE any columns unless it is necessary
Three compression levels
  • Hot data: No or low compression
  • Cold data: Medium or High compression
Informatica has Aster connector. Others uses nCluster loader.
Aqua Data Studio: http://www.aquafold.com/
Viewpoint portlet for Aster
  • Aster Node Monitor
  • Aster Completed Processes

Aug 13, 2017

Shinkai Makoto Movies


新海 誠監督の「君の名は。」がiTunesでレンタル開始されていたので観てみました。ちなみにこの3連休で「言の葉の庭」と「秒速5センチメートル」も観ています。 君の名は。の感想としては良かったんですが、イマイチ内容に入り込めなかった部分もある…というのが正直な感想です。内容が学生同士だからか、それとも田舎が舞台の話だからか登場人物目線で考えることがなんか難しく。 新海監督の評価として”写真(現実)よりも綺麗な絵”といったことを聞きますが、あそこまで綺麗だと絵の方にばかり目が行ってしまい、こんな綺麗な景色、現実ではないよなぁとかそんなことを考えてしまいました。 アニメなので家でいいかなと思って上映期間中に行かなかったんですが、もっと集中するためにも映画館で観た方が良かったかも?とちょっと思いました。映画だと消化不良な部分もあり小説も買ってみたので、のんびり読み進めてみます。

言の葉の庭は2回目なんですが、相変わらず良かったです。近いうちに引っ越す可能性が出てきたので、とりあえず新宿周辺で家を探しているぐらいには影響を受けています。言の葉の庭は小説版も凄い良いので、映画しか観ていない人にはオススメです。この人のこの行動にはこういった背景があったんだという、映画では描かれていなかったことが分かります。

秒速5センチメートルは初めて観たんですが、1回観ただけでは理解できない部分が多くWeb上のレビューを見て段々分かってきました。確かに君の名は。と繋がっている部分も多い感じがしますが、秒速の方が現実的で心に刺さる(ある意味刺される)内容でしょうか。キラキラというよりはドロドロで、君の名は。とも言の葉の庭とも違う世界でした。

僕は新海監督の熱心なファンではないので、適当なことを書いちゃってますが、「秒速5センチメートル」「言の葉の庭」とニッチな流れから「君の名は。」で王道な青春映画で大ヒットした中で次はどんな映画を作るかが気になります。

Jul 22, 2017

LINE Developer Meetup 19th

LINEのDeveloper Meetupに初めて行ってみました。普段は福岡でやっているデータ周りの内容を東京で開催してみたとのこと。 エンジニアとデータサイエンティストに加えてプランナーという職種を置いている話が印象的でした。以下はざっくりメモです。

東京・福岡のデータ分析チームについて/taiichi
組織内のロールはPlanner, Data scientis, Engineerの3つに分かれている。Plannerを置いている会社が少ない
データの品質管理を自動化(方法が気になるけどお聞きするタイミングがなかった…)
Clova platformはNAVERと共同で開発しているので検索エンジンの技術も使用されている
BIだけでなくプロダクトに近い部分でデータを活用している

大規模テキストマイニングによるユーザーの興味関心抽出及び可視化/tkengo
エンジニアの技術的興味関心を抽出するために、CrawlerとScraperを開発してWebからデータを収集
複数サイトでユーザーIDが異なる場合は外部サービスID連携で突合する
同一IDは同一人物と見越して同じと判断する(仮に別人でも数は非常に少ない)

Apache ZeppelinでPySparkを実行するまで/yuta hongo
ZeppelinはSpark以外にもMySQLなどに対してもクエリを実行できる
フォームを埋め込めるのでクエリが分からない人でも値を簡単に入れることができる

東京-福岡連携で実践するグロースハックプロジェクト/AyakaMatsuoka&doradora09
分析前にヒアリングをしてあいまいなポイントをつぶす
指標の定義が細かいことによる集計コストよりやり直しによるコストの方が大きい
「やること」だけでなく、「やりたくなりそうなこと」「できないこと」を事前に握っておく
リモートのメンバーに背景まで説明しなぜその分析をやるか?の疑問を抱かせない。東京側のメンバーだけで分析テーマやタスクを決定しない
リモート環境で分析する際の課題:コミュニケーションコスト
グロースハックはビジネス力、サービスの成長、スピード感、解釈のしやすさが重要
分析はコストなので、中朝的には利益への貢献が不可欠
施作案も分析結果と合わせて提示できるとベター
ユーザーファネルの月次変化(ミドルがハイや休眠に変化したことを見る)
報告テーマの順番の工夫、シンプルな解釈しやすいメッセージング

References
データ分析のための機械学習入門
Web UI for PrestoDB
Apache Zeppelin

Jul 17, 2017

Teradata SQL Memo

This is just a personal memo about Teradata SQL.

-- list all tables
select * from dbc.tables;

-- get ddl
show table table_name;
show view view_name;

-- show current user
select user;

--get suggestion of collect statistics target
diagnostic helpstats on for session;

-- ctas
create table schema.tablename_new as schema.tablename with data;

-- timestamp string -> timestamp -> date
select cast(cast('2016-03-07 23:59:59' as timestamp) as date)

-- decimal -> date -> string
select cast(cast(20170101 as char(8)) as date format 'YYYYMMDD') (format 'yyyy-mm-dd') (char(10))

--format sample
select cast(cast(9999.9999999 as dec(18,10) format '--ZZ9.9999999999' ) as varchar(50))
select cast(cast(123456789.987654321 as dec(19,10) format '--Z(7)9.9(7)' ) as varchar(50))
select cast(cast(-123456789.987654321 as dec(19,10) format '--Z(7)9.9(7)' ) as varchar(50))

--oreplace
select oreplace('hello warld', 'a', 'o');

--qualify
select *
from dbc.tables
qualify
 row_number() over (order by databasename) between 1 and 10

--csum(cumulative sum)
select
version,
csum(version, version)
from dbc.tables

--CSUM with reset logic (GROUP BY)
select
version,
csum(version, version)
from dbc.tables
group by version

--mavg(moving average)
select
version
,mavg(version, 3, version)
from dbc.tables

--mavg with reset
select
version
,mavg(version, 3, version)
from dbc.tables
group by 1

--msum(moving sum)
select
distinct
version
,msum(version, 3, version)
from dbc.tables

--mdiff(moving difference)
select
version
,mdiff(version, 3, version)
from dbc.tables

--rank
select
version
,rank(version)
from dbc.tables

--dense rank
select
version
,dense_rank() over (order by version)
from dbc.tables

--quantile
select
distinct
version
,quantile(10, version desc)
from dbc.tables

--lag and lead
select
calendar_date
,max(calendar_date) over(partition by 1 order by calendar_date rows between 1 preceding and 1 preceding)
,min(calendar_date) over(partition by 1 order by calendar_date rows between 1 following and 1 following)
from SYS_CALENDAR.CALENDAR
order by 1

-- merge
merge into department
using values (700, 'shipping', 990000.00) as dept(deptnum, deptname,budgamt)
on dept.deptnum = department_number
when matched then update
set budget_amount = dept.budgamt
when not matched then insert
values (dept.deptnum, dept.deptname, dept.budgamt, null);

-- with recursive
with recursive all_trips
(origin,
destination,
cost,
depth) as
(
select origin, destination, cost, 0
from flights
where origin = 'lax'
union all
select all_trips.origin,
       flights.destination,
       all_trips.cost + flights.cost,
       all_trips.depth + 1
from all_trips inner join flights
on all_trips.destination = flights.origin
and all_trips.depth < 3
)
select * from all_trips;

Aster SQL Memo

This is just a personal memo about Aster SQL.

-- drop table
drop table if exists schema.table;

-- create fact table
create table schema.table (
    id integer,
) distribute by hash(id) compress low;

-- create dimension table
create table schema.table (
    id character varying,
) distribute by replication compress low;

-- collect statistics
analyze schema.table;

-- * multiple columns to where-in isn't allowed
select * from schema.t1 where (c1, c2) in (select d1, d2 from schema.t2);
-- > error

-- cast
select cast(c1 as integer) from schema.table;
select c1::integer from schema.table;

-- collaborative filtering
select * from cfilter (
on (select 3)
partition by 3
inputtable ('public.input_table')
outputtable ('public.output_table')
inputcolumns ('c1')
joincolumns ('j1')
droptable ('true')
);

-- canopy
select * from canopy(
on (select 1) partition by 1
inputtable('public.input_table')
loosedistance('130.0')
tightdistance('65.0'))
order by canopyid
;

-- kmeans modeling
select * from kmeans
(on (select 1) partition by 1
inputtable('public.input_table')
outputtable('public.output_table')
numberK('3'))
;

-- kmeans scoring
select * from kmeansplot (
on public.input_table partition by any
on public.kmeans_modeling_output dimension
centroidstable ('public.kmeans_modeling_output')
) order by 2,1

Jul 15, 2017

Presto Parser Introduction

Prestoのパーサー部分ではANTLRが使われています。 ファイルは以下のパスにあります。パーサー系で追加したり修正する場合はまずここを参照することになるかと思います。
presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4
以下はCTAS文の抜粋です。例えば、SELECT文でカッコをつけても正常にパースしたい構文を追加する例をまずはあげてみます。

 CREATE TABLE (IF NOT EXISTS)? qualifiedName
   (COMMENT string)?
   (WITH tableProperties)? AS query
   (WITH (NO)? DATA)?                                             #createTableAsSelect

修正例はこんな感じになります。右側のカッコはエスケープするためにクォートで囲んでいます。 なんとなくイメージがつくと思いますが、|はORの役割を果たしているのでqueryもしくは(query)をこれは表しています。

 CREATE TABLE (IF NOT EXISTS)? qualifiedName
   (COMMENT string)?
   (WITH tableProperties)? AS (query | '('query')')
   (WITH (NO)? DATA)?                                             #createTableAsSelect

SqlBase.g4を修正したらpresto-parserに移動してビルドします。そうすることでSqlBaseParser.javaというファイルが自動で作成されます。 次に修正するのはAstBuilder.javaになります。この中のvisitCreateTableAsSelectメソッドでCreateTableAsSelectNodeを返却しています。 上記のqueryを囲むカッコを追加するケースでは後続の処理には影響しないので、変更する必要がありませんが、例えば(COMMENT string)?のように新たなプロパティを追加する際には、この部分を修正する必要があります。

パーサーから先は呼び出しの階層が深いので、説明が困難ですが他の構文を参考にすれば結構楽に追えると思います。presto-spiがコネクターとのインターフェースを担っているので、新たな構文を追加した際にはここにまずはインターフェースを定義し、コネクター側で実装します。

Jul 1, 2017

Shallow Deep Work


大事なことに集中する」という本を読んでみて、できるところから実践してみています。和名はこんなですが、Rebuild.fmで紹介されていたDeep Workに関する本です。

読んでみると色々思うことがあり、Donald Knuthのような偉大な人がメールすらやっていないのに、なぜ僕はSNSをやっているんだ!?という気持ちになります。 ということで、SNSはfacebook, instagramは削除、twitterは基本閲覧のみ。twitterもできれば辞めたいなぁと思いつつ、たまに知りたかった情報が流れてくることもありとても悩ましい…。LINEも勢い余って削除したんですが、さすがに周りに多少迷惑がかかっていたみたいなので、これは再インストール。

携帯も会社用と個人用に2つ持っていたんですが、 これもある意味気が散る原因な感じがするので会社用携帯に一本化してみました。個人用は契約しているデータ容量を下げて、会社用の番号に電話転送の設定を入れています。9月が解約月なので、この時に再考してみます。

仕事中は以前はメールは見た瞬間に返していたのですが、メーラーを常に起動しておくのを止めてみました。返信が必須ではないようなメールは朝イチ、昼休憩、帰る間際のどれかにだいたい返信しています。

本の中ではもっと極端なことが書いてあるのですが、まずはうえに書いたようなレベルで始めてみました。これで時間は以前より捻出できたと思ってますが、中身が伴わないと意味がないので、空いた時間で何をやるかというのは継続して考えていく必要があるなぁと感じています。

CMU DB Systems Spring 2017

Webで公開されているカーネーギーメロン大学のDBに関する授業を一通り見てみました。
CMU Database Systems Spring 2017


多くの情報工学科(CS)でCPUやOSに関する授業はあると思うんですが、DB関連で、SQLや正規化ではなく内部的な話をやる学校ってかなり珍しいのではと思っています。

L05 - Multi-Version Concurrency Control
Append-Only Storage: 全ての更新は新しいタプルをテーブルの空き部分に作成し、参照先をその部分に変更する
Time-Travel Storage: Time-Travelテーブルに複製し、メインテーブルのバージョンを上書きする
Delta Storage: 変更された値のみをDeletaストレージセグメントに複製し、マスターバージョンを上書きする
PKey Indexesはバージョンチェインのヘッドを常に指す。Append-Only
Secondary Indexes: Logical Pointers(Primary Indexに対してPrimary Keyを持つ)とPhysical Pointersの2つに分けられる
Microsoft HEKATON: OLTP Engine for MSSQL started in 2008

L06 - Index Locking & Latching 
Index Locks vs. Latches
INDEX: improves the speed of date retrieval. The additional costs are writes and storage spaces.
Order Preserving Indexes: Maintain keys in some sorted order. O(log n)
Hashing Indexes: Maps a hash of the key to particular record. Only supports equality predicates with O(1)
B-tree: Values in all nodes in the tree
B+tree: Only stores values in leaf nodes

Locks vs. Latches

  • Separate User transactions vs. Threads
  • Protect Database Contents vs. In-Memory Data Structures
  • During Entire Transactions vs. Critical Sections
  • Deadlock Detection & Resolution vs. Avoidance

Latch Implementions

  • Blocking OS Mutex std::mutex
  • Test-and-Set Spinlock(TAS) std::atomic<T>
  • Queue-based Spinlock (MCS) std::atomic<Latch*>
  • Reader-Writer Locks

Index Locking Schemes

  • Predicate Locks: Shared lock on SELECT-WHERE, Exclusive lock on UPDATE, INSERT, DELEET-WHERE. Hard to implement
  • Key-Value Locks: Locks that cover a single key value
  • Gap Locks
  • Key-Range Locks
  • Hierarchical Locking: Allow for a txn to hold wider key-range locks with differecnt locking modes

L07 Latch-free OLTP Indexes
Bw-Tree T-Trees: Based on AVL Trees. Stores pointers to their original values. Uses less memory. Difficult to rebalance, implement safe concurrent access
Skip Lists: Insert and Delete don’t require rebalancing
Index Implementation Issues
Reference Counting: Increment the counting before accessing, Decrement it when finished. Bad performance for multi core CPU
Read-Copy-Update
Epoch Garbage Collection

L11 Database Compression

  • Block-level
  • Tuple-level
  • Attribute-level
  • Column-level
NAÏVE COMPRESSION
  -> LZO(1996), LZ4(2011), Snappy(2011), Zstd(2015)

COLUMNAR COMPRESSION

  • Null Suppression (Useful in wide tables with sparse data)
  • Run-length Encoding
  • Bitmap Encoding (Only practical if the value cardinality is low)
  • Delta Encoding
  • Incremental Encoding
  • Mostly Encoding
  • Dictionary Encoding
Multi-Attribute Encoding
(original)
val1 val2
A    202
B    101
A    202
C    101
B    101

(compressed)
val1+val2
XX
YY
XX
ZZ
YY

val1 val2 code
A    202  XX
B    101  YY
C    101  ZZ

Order-Preserving Compression
(original)
name
Andrea
Joy
Andy
Dana
(compressed)

name
10
40
20
30
value  code
Andrea 10
Andy   20
Dana   30
Joy    40

select * from users where ename like ‘And’
vs
select * from users name between 10 and 20

select * from users where ename like ‘And’
  -> Still have to perform seq scan

select distinct * from users where ename like ‘And’
  -> Only need to access dictionary

OLTP DBMS cannot use the OLAP compress techniques because we need to support fast random tuple access. Indexes consume a large portion of the memory fo an OLTP database.

L12 Logging Protocols
The gold standard for physical logging & recovery in a disk-oriented DBMS is ARIES Algorithms for Recovery and Isolation Exploting Semantics

  • Write-Ahead Logging
  • Repeating History During Redo
  • Logging Changes During Undo

SiloR uses the epoch-based OCC, it achieves high performance by parallelizing all aspects of logging, checkpointing, and recovery.

  • The logger threads write buffers out to files. After 100 epoches, it creates a new file
  • Log record format, Id of the txn that modified the record(TID), (Table, Key, Value)

update people
set islame = true
where name in ('Dana', 'Andy')

Txn#1001
[people, 888, (islame->true)]
[people, 999, (islame->true]

L13 Checkpointing Protocols
Checkpointing allows the systems to ignore large segments of the log to reduce recovery time.

  • Naïve Snapshots
  • Hyper - Fork Snapshots
  • Copy-On-Update Snapshot
  • VoltDB - Consistent Checkpoints
  • Wait-Free ZigZag
  • Wait-Free Ping-Pong

Facebook Scuba
By storing the database shared memory, the DBMS process can restart and the memory contents will survive
On shutdown, copy data from heap to shared memory

L14 Optimizer Implementation (Part 1)
Heuristic-Based Optimization
Define static rules that transform logical operators to a physical plan
Advantages
  • Easy to implement and debug
  • Works reasonably well and is fast for simple queries

Disadvantages
  • Relies on magic constants that predict the efficacy of a planning decision
  • Nearly impossible to generate good plans when operators have complex inter-dependencies
Heuristics + Cost-Based Join Search
Use static rules to perform initial optimization. Then use dynamic programming to determine the best join order for tables

Advantages
  • Usually finds a reasonable plan without having to perform an exhaustive search

Disadvantages
  • Left-deep join trees are not always optimal
Randomized Algorithms
Advantages
  • jumping around the search space randomly allows the optimizer to get out out of local minimums
  • Low memory overhead (if no history is kept)

Disadvantages
  • Difficult to determine why the DBMS may have chosen a particular plan
  • Have to do extra work to ensure that query plans are deterministic
  • Still have to implement correctness rules

Starburst Optimizer
Advantages
  • Works well in practice with fast performance

Disadvantages
  • Difficult to assign priorities to transformations
  • Rules maintenance is a huge pain

Volcano Optimizer
Advantages
  • Use declarative rules to generate transformations
  • Better extensibility with an efficient search engine

Disadvantages
  • All equivalence classes are completely expanded to generate all possible logical operators before the optimization search
  • Not easy to modify predicates

L15 Optimizer Implementation (Part 2)
  • Stratified Search
    First rewrite the logical query plan using transformation rules and then perform a cost-based search to map the logical plan to physical plan
  • Unified Search
    Unify the notion of both logical→logical and logical→physical transformations
  • Cascade Optimizer
    Object oriented implementation of the Volcano query optimzer

Search Termination
  • Wall-clock Time
  • Cost Threshold
  • Transformation Exhaustion

L16 Cost Models
Cost model components are Physical costs, Logical costs and Algorithmic costs.
Postgres cost model uses a combination of CPU and I/O costs.
In memory is 400x faster then disk
  • Sequntial I/O is 4x faster than random I/O
  • IBM DB2 Learning Optimizer updates table stats as the DBMS scans a table during normal query processing.

L17 Query Execution & Scheduling

Process Models
  • Process per DBMS Worker
    • Each worker is a separated OS process
    • Use shared-memory for global data structures
  • Process Pool
    • A worker uses any process that is free in a pool
    • Bad for CPU cache locality
  • Thread per DBMS Worker
    • Single process with multiple worker threads
    • Thread crash (may) kill the entire system

Task Assignment
  • Push: A centralized dispatcher assigns tasks to workers and monitors their progress
  • Pull: Workers pull the next task from a queue, process it, and then return to get the next task
L18 Parallel Join Algorithms (Hashing)
Many OLTP DBMSs don't implement hash join.
Hash join is the most important operator in a DBMS for OLAP workloads.

Hash join
Phase #1 Partition
Phase #2 Build
Phase #3 Probe

Partition phase
Approach #1 Non-Blocking Partitioning (Shared Partitions or Private Partitions)
Approach #2 Blocking Partitioning (Radix)

Build phase
For each tuple, hash the join key attribute for that tuple and add it to the appropriate bucket in the hash table.

Probe phase
For each tuple in S, hash its join key and check to see whether there is a match for each tuple in corresponding bucket in the hash table constructed for R.

Hash functions
* MurmurHash (2008)
* Google CityHash (2011)
* Google FarmHash (2014)
* CLHash (2016)

Hash table implementations
* Chained hash table: maintain a linked list of buckets for each slow in the hash table.
* Open-Addressing hash table: Single giant table of slots. Resolve collisions by linearly searching for the next free slot in the table.
* Cuckoo hash table: Use multiple hash tables with different hash functions.

L19 Parallel Join Algorithms (Sorting)
SIMD
* Significant performance gains and resource utilization if an algorithm can be verctorized.

Moving data between DRAM and GPU is slow over PCI-E bus

Sort-Merge Join
Phase #1 Sort: Sort the tuples of R and S based on the join key
Phase #2 Merge: The outer relation R only needs to be scanned once

Sorting is always the most expensive part

Parallel Sort-Merge Join
Phase #1 Partitioning (optional): Partition R and assign them to workers / cores
Phase #2 Sort
Phase #3 Merge

Partitioning phase
* Divide the relations into chunks and assign them to cores

Sort phase
* Create runs of sorted chunks of tuples for both input replations
* It used to be that Quicksort was good enough
* Sorting networks is similar to ghost leg

Merge phase
* Iterate through the outer table and inner table in lockstep and compare join keys

Sort-Merge Join Variants
* Multi-way sort-mrge (m-way)
* Multi-pass sort-merge (m-pass)
* Massively parallel sort-merge (mpsm)

L20 Query Compilation
Code generation
* Transpilation
* JIT Compilation: generate IR(intermediate representation) of the query

L21 Vectorized Execution (Part I)
Mutli-Cores vs MIC(Many Integrated Cores)
Streaming SIMD Extensions (SSE) is a collection SIMD instructions that target special 128-bit SIMD registers
The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory


L22 Vectorized Execution (Part II)
Bitweaving
* Horizontal
* Vertical

OLTP Issues
* Run-time Operations ➝ Cold Tuple Identification
* Eviction Policies ➝ Timing and Evicted Tuple Metadata
* Data Retrieval Policies ➝ Granularity, Retireval Mechanism and Merging back to memory

Fundamental elements of circutis
1. capacitor
2. Registor
3, Inductor
4. Memristor

Phase-change memory (PRM)
* A short pulse changes the cell to a '0'
* A long, gradual pulse changes the cell to a '1'

Resistive RAM (ReRAM)
Running a current one direction moves electrons from the top TiO2 layter to the botttom, thereby changing the resistance

Magnetoresistive RAM (MRAM)
Stores data using magnetic storage elements instead of electric charge or current flows

NVM for database systems
Block-addressable NVM is not that interesting.
Byte-addressable NVM will be a game changer but will require some work to use correctly

NVM-Aware memory allocator
#1 synchronization
➝ The allocator writes back CPU cache lines to NVM using the CLFLUSH instruction
➝ It then issues a SFENCE instruction to wait for the data to become durable on NVM

#2 Naming
➝ The allocator ensures that virtual memory addresses assigned to a memory-mapped region never change even after the OS or DBMS restarts.