About Me

My photo
Software Engineer at Starburst. Maintainer at Trino. Previously at LINE, Teradata, HPE.

2017-09-28

Aster Basic Book


Below are personal memo about the book, Teradata Aster BASICS.

Chapter 1
Four key characteristics of Big Data?
Complexity, Variety, Velocity, Volume

Chapter 2
During Partition Splitting (Adding v-Workers), Aster Database is unavailable until the process is completed
  Adding Worker Nodes (More processing power per v-Worker)
  Adding v-Workers(Partition Splitting) More parallelism across the system
Queen node be RAID5
Worker and Loader be RAID0 for small configurations
Worker and Loader be RAID5 for large configurations

Chapter 3
What is QueryGrid?
is a client-based user interface that leverages the updates
manages optimization so that each platform does what it does best when resolving the query request

Chapter 5
MVCC: The result is that over time, the table can contain logically deleted (invisible) rows that are no longer used, but are consuming table space

Chapter 6
Multi-structured data: consists of files that contain a wide variety of different formats and data types in a non-fixed manner that must be parsed interpreted properly

Chapter 8
Map/Row function must implement the operateOnSomeRows method
Reduce/Partition function must implement the operateOnPartitionmethod

Chapter 10
Rows returned in the RESULT operator are always?
Aggregations

2017-09-23

Introduction to Teradata Aster Analytics

This a note for Teradata Aster Basics 6.10 Exam a.k.a TACP(Teradata Aster Certified Professional).
Recommended courses are followings and this note is for the 3rd course.

  • Teradata Certification, What’s New and How to Prepare
  • Introduction to Big Data and Teradata Aster*
  • Introduction to Teradata Aster Analytics
  • Introduction to Teradata Aster Database Administrator*

Map function doesn’t have PARTITION BY
Reduce function has PARTITION BY
PARTITION BY affects SHUFFLE phases

Quiz
What are the two different types of SQL-MR functions? (Choose two.)
  Partition functions (Reduce) & Row functions (Map)
In SQL-MR the ON clause can be what three things? (Choose three)
  Function, Table & Query
Fill in the missing word. How do you distinguish between a Map vs. Reduce function?
  A REDUCE function has a PARTITION BY clause, whereas the Map function does not
What criteria would you use to determine if you want to run SQL versus SQL-MR? Select the four criteria that are better suited to run with SQL-MR. (Choose four)
  Unstructured or multi-structured data, Machine learning algorithm, Recursive querie & Self-joins

Acquistion functions

  • load_from_hadoop
  • load_to_hadoop
  • load_from_hadoop_dir
  • load_from_pst
  • load_tweets
  • anydatabase2aster
  • load_from_s3
  • load_to_s3

Define foreign server
create foreign server hdp21
using server('192.168.100.21')
dbname('default') username('hue')
do import with load_from_hcatalog,
do export with load_to_hcatalog;

create foreign server td15
using tdpid('192.168.100.15')
username('td01') password('td01')
do import with load_from_teradata,
do export with load_to_teradata;
Pull & Push-down query

--pull
select c1, sum(c2)
from t1@td15
group by 1;

--push down
select * from FOREIGN SERVER
($$ select c1, sum(c2) from t1 group by 1 $$)@td15;

Quiz
Which two Teradata QueryGrid connectors can acquire data for Aster? (Choose two)
  Aster-to-Teradata & Aster-to-Hadoop
Why move data between Teradata and Aster? Match Aster and Teradata to what each database is best designed for.
  Aster - for analytics by limited number of data scientists
  Teradata - for high concurrency (hundreds of users)
What are some Teradata Aster parser functions?
  Apache logs, xml, json and pst
Using the Stream API, you can write functions in programming languages that are not native to Teradata Aster (e.g., write non SQL-MR or SQL-GR functions) and run them on Aster, generating output that Aster can receive, including:

  • writing R functions to run on Aster
  • write custom python, perl, C/C++/C# functions to run on Teradata Aster


Quiz
nPath is used for Pattern Matching across Time Series
What three expressions are used to specify input data for nPath? on, partition by and order by What three expressions are used to specify nPath search criteria?
  mode, pattern and symbols
What kind of function is Kmeans?
  clustering
What kind of function is Decision Tree?
  predictive function

Quiz
What visualization function(s) are in Teradata Aster AppCenter?
  Visualizer (formerly nPathViz and cFilterViz)
What needs to be configured before building a new Application?
  Create a JDBC connection
Name three Data Format types. (Choose three)
  nPath, Table and cFilter
Name four different chart types that Teradata Aster AppCenter visualizations create. (Choose four)
  Tree, SanKey, Sigma and Chord
How can users dynamically change Teradata Aster AppCenter chart visualizations?
  By clicking on objects and/or by changing Layout/Format specs

Quiz
How do you connect to Aster via RStudio?
  Aster ODBC driver
What is the name of the Aster package for Teradata Aster R?
  TeradataAsterR
You want to access Help for Teradata Aster R to see a list of commands. What syntax would accomplish this?
  help(package=’TeradataAsterR’)


Final Exam SCORE: 96 PASSED Question 1 Correct

  1. True or False: Map-Reduce is a programming model and an associated implementation for processing and generating large data sets.
Your answer: True Question 2 Correct
  2. Each Map function performs an ETL on ____ in the input.
Your answer: all rows Question 3 Correct
  3. The ___ gets a key and the array of values emitted with that key and produces the final result.
Your answer: Reduce Function Question 4 Correct
  4. The SQL-MR syntax ON clause specifies the input rows, which can be a ___. (Choose four)
Your answer: Table,View,Sub-query,SQL-MR function Question 5 Correct
  5. Does the syntax use a Map Function or a Reduce Function? Drag and Drop the Map Function and Reduce Function labels (at left) to the correct syntax (at right).
Your answer: 1-2,2-1 Question 6 Correct
  6. 6. True or False: Functions can be Map and Reduce functions at the same time.
Your answer: False Question 7 Correct
  7. In the syntax below, click on the input.
Your answer: 5 Question 8 Correct
  8. 8. Match the function (at left) with its description (at right):
Your answer: 1L-1R,2L-2R,3L-3R,4L-4R Question 9 Correct
  9. The _____ is used for clustering. Clustering is a fast/simple method for grouping objects into preliminary clusters using an approximate distance method. Each point is represented as a point in a multidimensional feature space.
Your answer: Canopy function Question 10 Correct
  10. True or False: Map-Reduce is a programming model and an associated implementation for processing and generating large data sets.
Your answer: True Question 11 Correct
  11. The _____ can extract multiple columns of structured data from standard Apache Web Logs.
Your answer: Apache Log Parser Question 12 Correct
  12. Match the function (at left) with what it’s used for (at right):
Your answer: 1L-1R,2L-2R,3L-3R,4L-4R,5L-5R,6L-6R,7L-7R,8L-8R Question 13 Correct
  13. This question tests your knowledge of nPath pattern matching using the mode: non-overlapping and the pattern: ‘B+.C.A’. Given this input table and nPath syntax, which pattern matches will be in the output rows?
Your answer: 1 row: BBBCA Question 14 Correct
  14. This question tests your knowledge of nPath pattern matching using the mode: overlapping and the pattern: ‘B+.C.A’. Given this input table and nPath syntax, which pattern matches will be in the output rows?
Your answer: 3 rows: BBBCA, BBCA, BCA Question 15 Incorrect
  15. True or False: In Teradata Aster a single SQL-MR statement can call all of the necessary functions to go through acquiring data, to preparing it, to the multi-genre analyzing of it, and finally to visualizing it.
Your answer: False Correct answer: True Question 16 Correct
  16. In Teradata Aster the ____function creates a row in a visualization table where Teradata Aster AppCenter can access it, view it, and manipulate it.
Your answer: ‘Visualizer’ Question 17 Correct
  17. True or False: Teradata Aster R packages addresses the Challenges of R by allowing programmers to scale R analytics by leveraging Teradata Aster.
Your answer: True Question 18 Correct
  18. Before beginning able to connect to the Teradata Aster cluster and start issuing Teradata Aster R commands, you must do which two things? (Choose two) Your answer: Install/configure Teradata Aster 6.20 ODBC driver,Install RODBC and Teradata Aster R packages
Question 19 Correct
  19. What are RMapReduce runners?
Your answer: Functions to run R-code in Teradata Aster Question 20
  20. Match the Teradata Aster R function to what you would use it for:
Question 20.1 Correct Your answer: 1L-1R Question 20.2 Correct Your answer: 2L-2R Question 20.3 Correct Your answer: 3L-3R Question 20.4 Correct Your answer: 4L-4R Question 20.5 Correct Your answer: 5L-5R


2017-09-21

Introduction to Teradata Aster Database Administration


This a note for Teradata Aster Basics 6.10 Exam a.k.a TACP(Teradata Aster Certified Professional).
Recommended courses are followings and this note is for the 2nd course.
  • Teradata Certification, What’s New and How to Prepare
  • Introduction to Big Data and Teradata Aster
  • Introduction to Teradata Aster Analytics
  • Introduction to Teradata Aster Database Administrator
nc_system schema holds system information.
3 categories of DD views
  • nc_all
  • nc_user
  • nc_user_owned
Replication Factor
  • RF=1: No secondary v-wokers. No fallback
  • RF=2: If a woker goes down, secondary v-worker will be promoted to the new primary v-worker. Max is 2. Primay and its replica are not located in the same Worker node.
Ganglia is a open source, web-based, scalable distributed system monitoring tool.
AMC Status
  • Green: operating normally
  • Blue: decrease in performance
  • Yellow: unable to process statement requests
  • Red: stopped
  • White/Clear: no longer able to establish a connection
Aster Database only supports B-tree indexes, cannot enforce referential integrity.
There is no data sharing among Aster databases.

/*Change database*/
beehive=> \connect retails_sales;
retails_sales=>
retails_sales=>database beehive;
beehive=>

/*help*/
beehive=>\?

/*List database*/
beehive=> \l

/*Exit database*/
beehive=> \q

/*List schemas*/
beehive=> \dn

/*View tables in the PROD schemas*/
beehive=> \dt prod.*

/*View columns/data types*/
beehive=> \d prod.sales_fact

/*Show current schema*/
show search_path;
ALTER USER beehive SET SEARCH_PATH = 'public', 'mkt';

Select table by walking the path until finding the name
CONNECT privilege must be given to access the database. USAGE privilege must be given to access the schema.
Two Serial types: Global and Local.
  • A Serial Global type ensures the serial property across all of the nodes in the system.
  • A Serial Local type ensures the serial property local to each logical partition of data.
PARTITION BY RANGE: START include the value but END exclude the value
partition sales_june (START'2017-06-01'::date END'2017-07-01'::date)
PARTITION BY LIST: If an incoming row doesn not fit into any partition, that row will not be loaded into the table
1. Data Modeling Quiz
  • Q. Best schemas for Teradata Aster databases
    A. Star schema and Snowflake schema
Aster column name rules
  • Starts with a character
  • Must be < 63 characters
  • Names may include special characters (_ , $)
Constraint Options
  • Null/Not Null
  • Primary key
  • Default values
  • Check values
create table stuff
(
emp int NOT NULL PRIMARY KEY,
dept varchar DEFAULT 'none',
age smallint CHECK(age >= 18 and age <= 70),
name varchar
)
distribute by replication
;

Data Types
  • CHAR, CHARACTER VARYING, VARCHARA(n) maximum is 10MB
  • TEXT is unlimited
  • Special type are Boolean, Bytea, Serial, Big Serial
Supported data types of distribution
  • smallint
  • integer
  • bigint
  • numeric
  • text
  • varchar
  • uuid
  • bytea
For large tables (> 1million rows, usually Fact table)
For small tables (<= 1million rows, usually Dimension table)
If ASH key and JOIN columns doesn’t match, SHUFFLE will occure.
TRUNCATE: Quickly remove all rows and it reclaims disk space immediately
VACUUM: Converts dead space into usable free space
VACUUM FULL ANALYZE: Physically rearrange the data on disk
NC_RELATIONSTATS: Generate various reports

2: Creating Tables Quiz
Tables in a Teradata Aster Database can be of which four variations? (Choose four.)
  • Temporary(Fact/Dim)
  • Analytic(Fact/Dim)
  • Fact
  • Dimension
What data type is commonly used for “payload” columns? Click on the correct data type in the image.
TEXT
In Teradata Aster, table data may be partitioned in which two ways? (Choose two.)
Logically Partitioned tables (Logical)
Fact tables(Physical)
How do these two partitioning types improve performance? Match the partitioning type to how it improves performance.
Physical: More v-Workers equal more parallelism
Logical: Reduced disk I/O by only reading needed partitions
Scenario: You join 2 FACT tables where the Hash column matches the JOIN column. Will a shuffling of data occur?
No, the JOIN will commence immediately since JOIN column values are guaranteed to be on the same v-Worker.
nCluster loader arguments
  • -B –begin-script
  • -E –end-script
  • -d –dbname
  • -D –delimiter
  • -c –csv
  • -l –loader
  • –truncate-table
  • -w –password
  • -z –auto-analyze
  • -U –username
  • -p –port
  • –el-enabled
  • –skip-rows-1
Default delimited format is TSV
-B and -E specify script name to execute it
Parallelizing the Load tier
  • Add more loader nodes
  • Add more staging machines
  • Add more nCluster loaders running on the staging machines
Error logging is turned off by default. This means the load job will abort and rollback the data on encoutering the first error.
  • –el-enabled
  • –el-limit <#>
  • –el-table
  • –el-label
  • –el-errfile
ncluster_export example
ncluster_export -h 192.168.100.100 -d beehive -U beehive -w beehive
\"aaf\".\"accesslog\" myfile.txt

3: Data Loading Quiz
What is the name of the Teradata Aster Database bulk loading tool? ncluster_loader
Which two node types can handle Teradata Aster data loading ? (Choose two.)
Loader nodes and Queen node (if there are no loader nodes)
Which task do Loader nodes perform during loading?
Hashing the Distribution Key for v-Worker placement
The loading tier can be scaled in which three ways? (Choose three.)
Add more nCluster Loaders, Loader Nodes, Staging Machines
In addition to the nCluster Loader Tool, which four other types of tools are used to load a Teradata Aster Database? (Choose four.)
ETL Tools, SQL Statements, Connectors, Teradata QueryGrid (Aster-to-Hadoop, Aster-to-Teradata)

Final Exam
  • Q. You have 5 Teradata Aster Databases. How many Data Dictionaries do you have?
  • A. 5 - one for each Teradata Aster Database
  • Q. Which two statements are true regarding a Teradata Aster Database? (Choose two.)
  • A. Each user must be given the CONNECT privilege on a database to access objects on the database,
  • By default, there is one database in a new installed Teradata Aster cluster called, beehive
  • Q. True or False: The Aster Loader Tool must point to the Queen and can optionally point to the Cluster Loader Node for hashing. A. True
  • Q. Which two statements are true regarding a Teradata Aster Database? (Choose two.)
  • A. Data objects may be shared across schemas in the same database,Users can join tables from one schema with tables in another schema if they have proper privileges for the schemas/tables

2017-09-20

RubyKaigi 2017


Followings are my personal memo, so sorry about the dirty and lack of sentence. Even if there’s less memo, it doesn’t mean the talk’s quality is low. I just concentrated on listening the talk.

Day1

Matz team. (are only Matz and Nobu)
Daily: Debugging, New features, Bug making etc
Why not Git? Windows is not supported officially. Not enough advantage
Developers’ meeting are held once per month
How to build Ruby is configure and make
BASERUBY: pre-installed ruby. generate source files
MINIRUBY: ruby made during the build. No dynamic loading. Unable to load extension libraries.
Mimic global variables used in mkmf.rb by trace_var
Following is exactly not a bug
p = 2
p (-1.3).abs = -1.3
Demon Castle parse.y by name
Monstrous lex_state
literal symbol by intern
Refining String#intern returns no-symbol
New features in 2.5?
Such as $. 2.3
Unicode case 2.4
Approved Array #append, #prepend
Rejected neko ^..^ operator (make range), User-defined operator
Under Discussion Method extraction operator (Kernel#method -> Method instance), Rightward assignment
Wouldn’t you write New Ruby?
Ruby History
to_json
jbuilder is very slow
ActiveModel::Serializers JSON-Schema API Blueprint(apiary), OpenAPI(Swagger), RAML, JSON Hyper-Schema, bare JSON Schema
Why choose OpenAPI? It has RESTful definition than API Blueprint
OpenAPI is developed by Swagger originally. Give type into OpenAPI
Use $merge and $ref to OpenAPI porting
GraphQL is API query language like SQL, WYSIWYG, has only 1 endpoint
Ecosystem is still insufficient
BFF is the abbreviation of Backend for Frontend
5000 warmup iterations increase 5-7%
Code is available in noahgibbs/rails_ruby_bench Rails version is 4.2.6
PyCall runs in Ruby interpreter
Use pandas in Rails app
Python is a best friend of Ruby from now on!
PyCall should be a temporary way
Red Data Tools project will be a home
Apache Arrow aims to be Arrow Memory for common tools
Red Arrow is Ruby’s one for Apach Arrrow
Sutou-san (@kou) officially became a member of PMC of Apach Arrow yesterday
Jupyter Notebook also supports Ruby
Python is managed by reference counting
Some types(string, dictionary etc) are converted from python to ruby primitive type
Ruby Commiters vs the World
Method chain is left to right, but why a substituion is left? is the motivation for rightward. Use method chain (e.g. assign_to)
Matz likes Swift and Closure except for Ruby, Emacs Lisp and Stream
ActiveSupport is made for web application
Day2
RubyKaigi is the biggest ruby conference!
Talk about talent
Love of languages
First language Matz used is Basic
Matz posted about Ruby to Python mailing list
Simula (1968) first object oriented language
Created by Dr. Kristen Nygaard
Lisp (Flavors)
C3 linearization algorithm
Alias method chain vs Module#prepend
CLOS(Common List Object System) Method combination enables hook before/after/around method calls
Gorilla, Guerrilla, Monkey patching has global influence
Scoped monkey patching?
Line coverage, Branch coverage, Path coverage
C0, C1, C2 coverage
Coverage is just a measure, not a goal
SimpleCov is a wrapper for coverage.so
Method definition is counted as an execution
Concov detets temporal change of coverage
13901 is ticket
RuboCop is an industry standard solution
RubyMine can detect more errors than RuboCop
Coverage is a lie
YARV compiles code into the bytecode
JetBrains/ruby-type-inference
JIT Type Checking for Dynamic Languages (Ren, 2016)
def foo(x)
  "".bar if x
end

foo(false)
Steep check types with annotations
Editor specific library & Universal LSP Client plugin
Auto complete & Jump definition is WIP
mtsmfm/language_server-ruby
Syntax check uses ruby -wc filename.rb Auto completion uses rcodetools
Day3
Copy on Write(CoW)
memsize_of, dup
CoW Page Fault cause bad performance
Unicorn is a forking webserver which has parent and many childs
Garbage Collector affects CoW
GC Compactions means move objects to avoid OS replicate many pages
Some people said “Moving objects is impossible”
Aaron tried ‘Two Finger Compaction’
Disadvantages slow
Advantage easy
Object Movement uses two finger. One is Free Pointer the other is Scan Pointer
Free Pointer detect free address, Scan Pointer detect used address until the two finger meet
What objects can move?
Everything!?
GC can knows ruby’s reference easily
If reference is C extension, it’s difficult
hash_key uses memory address: fix cache hash key
rb_gc_mark Dual References: fix Call rb_gc_mark, or use only ruby
rb_define_class
string literals It seems like nothing can move, but most can be fixed
46% can move!
ObjectSpace.dump_all can output as json
tenderlove/heap-utils
/proc/{PID}/smaps
Question your assumptions
Replication with Quoram
Create 2 replica of data at least (max 3). IF first response is ok, discard a thread for another node
Bigdam: Edge locations on the earch + the Central location
Bigdam-pool: Distributed key-value storage. to build S3 free data ingestion pipeline
chunk id guarantees the uniqueness
Ruby 2.5 supports block-wide rescue

JRuby’s Startup Warmup phase is slower than MRI
Have to lead tons of java classes
Warmup phase has JRuby(Ruby Interpret, Ruby JIT) & JVM(Java Interpret, Java JIT)
Graal is new JIT for all languages on the JVM
jruby/jruby
RTL
In special case, when GCC optimize simple loop, the byte code doesn’t have loop, but JVM cannot do this
MJIT status
* Unstable, doesn’ work on Windows, one more year to mature
* No inlining yet(most important optimizations!), Use C inlining, new GCC/LLVM extension
* Will RTL and MJIT be a part of MRI?
References
Slides

This is my first RubyKaigi but really interesting. Especially, Aaron’s talk left an impression on me. Of course Matz’s talk is also. Until this conference, I simply liked writing Ruby, but now I like Ruby and the community. (Uh, sounds like a poem) Matsuda-san announced that the next RubyKaigi is held in Sendai, Miyagi (June 31st - July 2nd). I will definitely go again next year!


Atomic-Bomb Dome

Peace Memorial Park

Hiroshima Castle