About Me

My photo
Software Engineer at Starburst. Maintainer at Trino. Previously at LINE, Teradata, HPE.

2023-05-04

Migrate Hive tabes to Iceberg in Trino

Trino version 411 introduced 'migrate' procedure in Iceberg connector. This procedure coverts the existing Hive tables with ORC, Parquet & Avro format to Iceberg table. This article explains the details of the procedure. If you execute CREATE TABLE AS SELECT statement to convert Hive to Iceberg, I would recommend trying this procedure. The procedure will be much faster because it doesn't rewrite files. 

The procedure accepts 3 arguments (schema_name, table_name and optional recursive_directory). The possible values for recursive_directory argument are true, false and fail. The default value is fail that throws an exception if the nested directory exists under the table or partition location.  

CALL iceberg.system.migrate(schema_name => 'testdb', table_name => 'customer_orders', recursive_directory => 'true'); 

Let me explain the details of the implementation next. 

  1. Generate Iceberg schema object based on Hive table definition
  2. Iterate over the table or partition location to create Iceberg metadata files
    1. Build Iceberg Metrics 
    2. Build Iceberg DataFile
  3. Update the table definition in the metastore

All logic exists in MigrateProcedure.java if you want to check the code.

Limitation

  • The procedure scan files sequentially. If the target table has a lot of files, it will take a long time to complete. 
  • The PR to migrate Delta Lake tables to Iceberg is in progress https://github.com/trinodb/trino/pull/17131. There's no plan to support other format (e.g. Hudi) at this time.