Write an Orc File

Updated: Jun 8, 2023

ORC

This example shows you how to write records to an ORC file using Data Pipeline's OrcDataWriter.

Optimized Row Columnar (ORC) is a file format which provides a highly efficient way to store Hive data.

Java Code listing

package com.northconcepts.datapipeline.examples.orc;

import java.io.File;
import java.math.BigDecimal;
import java.math.BigInteger;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DebugReader;
import com.northconcepts.datapipeline.core.FieldType;
import com.northconcepts.datapipeline.core.Record;
import com.northconcepts.datapipeline.core.RecordList;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.internal.lang.Moment;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.memory.MemoryReader;
import com.northconcepts.datapipeline.orc.OrcDataReader;
import com.northconcepts.datapipeline.orc.OrcDataWriter;

public class WriteAnOrcFile {

    private static final File ORC_FILE = new File("example/data/output/WriteAnOrcFile.orc");

    public static void main(String[] args) {
        System.out.println("============================================================");
        System.out.println("Write records to an ORC file");
        System.out.println("============================================================");

        DataReader reader = new MemoryReader(createRecordList());
        reader = new DebugReader(reader);
        OrcDataWriter writer = new OrcDataWriter(ORC_FILE);
        Job.run(reader, writer);

        System.out.println("============================================================");
        System.out.println("Prepared Schema");
        System.out.println("============================================================");

        // System.out.println(writer.getSchema()); //TODO this shows compile time exception!!!

        System.out.println("============================================================");
        System.out.println("Read the ORC file");
        System.out.println("============================================================");

        Job.run(new OrcDataReader(ORC_FILE), new StreamWriter(System.out));

    }

    public static RecordList createRecordList() {
        RecordList recordList = new RecordList();

        Record record1 = new Record();
        record1.setField("BLOB", new byte[] { 2, 4, 6, 8, 10, 12 });
        record1.setField("BOOLEAN", true);
        record1.setField("BYTE", (byte) 97);
        record1.setField("CHAR", 'A');
        record1.setField("DATE", Moment.parseMoment("2014-12-25").getDatePart());
        record1.setField("DATETIME", Moment.parseMoment("2014-12-25 13:41:57").getDate());
        record1.setField("DOUBLE", 2048.1024);
        record1.setField("FLOAT", 4096.32f);
        record1.setField("INT", 8192);
        record1.setField("LONG", 1152921504606846976L);
        record1.setField("SHORT", (short) 32);
        record1.setField("BIG_DECIMAL", new BigDecimal("123.456789"));
        record1.setField("BIG_INTEGER", BigInteger.valueOf(123456L));
        record1.setField("STRING", "A basic numeric constant is considered an integer.");
        record1.setField("TIME", Moment.parseMoment("13:41:57").getTimePart());
        record1.setField("Array-2", new String[] { "J", "A", "V", "A" });
        record1.setField("Array-3", new Double[] { 123.123, 345.345, 456.456, 555.678 });
        record1
            .setField("RECORD",
                    new Record()
                        .setField("RECORD",
                                new Record()
                                    .setField("STRING", "A basic numeric constant is considered an integer.")
                                    .setField("DOUBLE", 1234.12345D)));

        // Record with null values.
        Record record2 = new Record();
        record2.setFieldNull("BLOB", FieldType.BLOB);
        record2.setFieldNull("BOOLEAN", FieldType.BOOLEAN);
        record2.setFieldNull("BYTE", FieldType.BYTE);
        record2.setFieldNull("CHAR", FieldType.CHAR);
        record2.setFieldNull("DATE", FieldType.DATE);
        record2.setFieldNull("DATETIME", FieldType.DATETIME);
        record2.setFieldNull("DOUBLE", FieldType.DOUBLE);
        record2.setFieldNull("FLOAT", FieldType.FLOAT);
        record2.setFieldNull("INT", FieldType.INT);
        record2.setFieldNull("LONG", FieldType.LONG);
        record2.setFieldNull("SHORT", FieldType.SHORT);
        record2.setFieldNull("BIG_DECIMAL", FieldType.BIG_DECIMAL);
        record2.setFieldNull("BIG_INTEGER", FieldType.BIG_INTEGER);
        record2.setFieldNull("STRING", FieldType.STRING);
        record2.setFieldNull("TIME", FieldType.TIME);
        record2.setFieldNull("Array-2", FieldType.STRING);
        record2.setFieldNull("Array-3", FieldType.DOUBLE);
        record2.setFieldNull("RECORD", FieldType.RECORD);

        recordList.add(record1, record2);
        return recordList;
    }
}

Code Walkthrough

createRecordList() creates records with and without values and add them to RecordList object.
setField() creates a field with a field name specified in the first parameter and a value specified in the second parameter.
setFieldNull() creates an empty field with a field name specified in the first parameter and a field type specified in the second parameter.
MemoryReader is created to read records from RecordList object which is returned by createRecordList().
DebugReader is created using MemoryReader object to print records passing through a stream.
OrcDataWriter is created to write records to the specified ORC file(i.e WriteAnOrcFile.orc).
Data are transferred from DebugReader to the OrcDataWriter via Job.run() method.
writer.getSchema() method returns the schema that is used to write the specified Orc file. This schema contains a group of fields called a message. Each field in the message has three attributes: a repetition, a type and a name .
OrcDataReader is created and passed to Job.run() method to stream records from the output ORC file.
Data are transferred from OrcDataWriter to the console via Job.run() method. See how to compile and run data pipeline jobs.

OrcDataWriter

Writes records to Apache ORC columnar files. It extends IntegrationWriter class and can be created using File or Path object. Using this class you can also specify the configuration, schema and batch size (default 1024 ) of an output ORC file.

OrcDataReader

Reads records from Apache ORC columnar files. Using this class you can also get the configuration, schema and columns of a given ORC file.

Record

Record class holds persistent data in key-value fields as it flows through the pipeline. A method setField() in this class creates a new field as key-value pair by taking field name and a value as a parameter. If you don't' want to add values to fields upon creation, you can create an empty field with filed name and type using setFieldNull() method.

FieldType

An enumeration class FieldType lists of all field data types used in records. UNDEFINED is used for any type not natively supported by Data Pipeline.

RecordList

As the name suggests it is used to store a list of Record objects in memory. It implements Iterable so you can perform operations similar to Java Collection to this object.

MemoryReader

Obtains records from an in-memory RecordList. It extends DataReader class and can be created with RecordList(optional) object.

DebugReader

A proxy that prints records passing through to a stream in a human-readable format. This will help you to track and debug every record that you are writing to an ORC file.

All Tags | All Examples