Write an Orc File
This example shows you how to write records to an ORC file using Data Pipeline's OrcDataWriter.
Optimized Row Columnar (ORC) is a file format which provides a highly efficient way to store Hive data.
Java Code listing
package com.northconcepts.datapipeline.examples.orc; import java.io.File; import java.math.BigDecimal; import java.math.BigInteger; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.DebugReader; import com.northconcepts.datapipeline.core.FieldType; import com.northconcepts.datapipeline.core.Record; import com.northconcepts.datapipeline.core.RecordList; import com.northconcepts.datapipeline.core.StreamWriter; import com.northconcepts.datapipeline.internal.lang.Moment; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.memory.MemoryReader; import com.northconcepts.datapipeline.orc.OrcDataReader; import com.northconcepts.datapipeline.orc.OrcDataWriter; public class WriteAnOrcFile { private static final File ORC_FILE = new File("example/data/output/WriteAnOrcFile.orc"); public static void main(String[] args) { System.out.println("============================================================"); System.out.println("Write records to an ORC file"); System.out.println("============================================================"); DataReader reader = new MemoryReader(createRecordList()); reader = new DebugReader(reader); OrcDataWriter writer = new OrcDataWriter(ORC_FILE); Job.run(reader, writer); System.out.println("============================================================"); System.out.println("Prepared Schema"); System.out.println("============================================================"); // System.out.println(writer.getSchema()); //TODO this shows compile time exception!!! System.out.println("============================================================"); System.out.println("Read the ORC file"); System.out.println("============================================================"); Job.run(new OrcDataReader(ORC_FILE), new StreamWriter(System.out)); } public static RecordList createRecordList() { RecordList recordList = new RecordList(); Record record1 = new Record(); record1.setField("BLOB", new byte[] { 2, 4, 6, 8, 10, 12 }); record1.setField("BOOLEAN", true); record1.setField("BYTE", (byte) 97); record1.setField("CHAR", 'A'); record1.setField("DATE", Moment.parseMoment("2014-12-25").getDatePart()); record1.setField("DATETIME", Moment.parseMoment("2014-12-25 13:41:57").getDate()); record1.setField("DOUBLE", 2048.1024); record1.setField("FLOAT", 4096.32f); record1.setField("INT", 8192); record1.setField("LONG", 1152921504606846976L); record1.setField("SHORT", (short) 32); record1.setField("BIG_DECIMAL", new BigDecimal("123.456789")); record1.setField("BIG_INTEGER", BigInteger.valueOf(123456L)); record1.setField("STRING", "A basic numeric constant is considered an integer."); record1.setField("TIME", Moment.parseMoment("13:41:57").getTimePart()); record1.setField("Array-2", new String[] { "J", "A", "V", "A" }); record1.setField("Array-3", new Double[] { 123.123, 345.345, 456.456, 555.678 }); record1 .setField("RECORD", new Record() .setField("RECORD", new Record() .setField("STRING", "A basic numeric constant is considered an integer.") .setField("DOUBLE", 1234.12345D))); // Record with null values. Record record2 = new Record(); record2.setFieldNull("BLOB", FieldType.BLOB); record2.setFieldNull("BOOLEAN", FieldType.BOOLEAN); record2.setFieldNull("BYTE", FieldType.BYTE); record2.setFieldNull("CHAR", FieldType.CHAR); record2.setFieldNull("DATE", FieldType.DATE); record2.setFieldNull("DATETIME", FieldType.DATETIME); record2.setFieldNull("DOUBLE", FieldType.DOUBLE); record2.setFieldNull("FLOAT", FieldType.FLOAT); record2.setFieldNull("INT", FieldType.INT); record2.setFieldNull("LONG", FieldType.LONG); record2.setFieldNull("SHORT", FieldType.SHORT); record2.setFieldNull("BIG_DECIMAL", FieldType.BIG_DECIMAL); record2.setFieldNull("BIG_INTEGER", FieldType.BIG_INTEGER); record2.setFieldNull("STRING", FieldType.STRING); record2.setFieldNull("TIME", FieldType.TIME); record2.setFieldNull("Array-2", FieldType.STRING); record2.setFieldNull("Array-3", FieldType.DOUBLE); record2.setFieldNull("RECORD", FieldType.RECORD); recordList.add(record1, record2); return recordList; } }
Code Walkthrough
createRecordList()
creates records with and without values and add them to RecordList object.setField()
creates a field with a field name specified in the first parameter and a value specified in the second parameter.setFieldNull()
creates an empty field with a field name specified in the first parameter and a field type specified in the second parameter.- MemoryReader is created to read records from
RecordList
object which is returned bycreateRecordList()
. - DebugReader is created using
MemoryReader
object to print records passing through a stream. - OrcDataWriter is created to write records to the specified ORC file(i.e
WriteAnOrcFile.orc
). - Data are transferred from
DebugReader
to theOrcDataWriter
viaJob.run()
method. writer.getSchema()
method returns the schema that is used to write the specified Orc file. This schema contains a group of fields called a message. Each field in the message has three attributes: a repetition, a type and a name .- OrcDataReader is created and passed to
Job.run()
method to stream records from the output ORC file. - Data are transferred from
OrcDataWriter
to the console via Job.run() method. See how to compile and run data pipeline jobs.
OrcDataWriter
Writes records to Apache ORC columnar files. It extends IntegrationWriter class and can be created using File
or Path
object. Using this class you can also specify the configuration, schema and batch size (default 1024 ) of an output ORC file.
OrcDataReader
Reads records from Apache ORC columnar files. Using this class you can also get the configuration, schema and columns of a given ORC file.
Record
Record class holds persistent data in key-value fields as it flows through the pipeline. A method setField()
in this class creates a new field as key-value pair by taking field name and a value as a parameter. If you don't' want to add values to fields upon creation, you can create an empty field with filed name and type using setFieldNull()
method.
FieldType
An enumeration class FieldType
lists of all field data types used in records. UNDEFINED
is used for any type not natively supported by Data Pipeline.
RecordList
As the name suggests it is used to store a list of Record objects in memory. It implements Iterable so you can perform operations similar to Java Collection to this object.
MemoryReader
Obtains records from an in-memory RecordList. It extends DataReader class and can be created with RecordList
(optional) object.
DebugReader
A proxy that prints records passing through to a stream in a human-readable format. This will help you to track and debug every record that you are writing to an ORC file.