Read an Orc File

Updated: Aug 22, 2023
ORC

This example shows how to read ORC (Optimized Row Columnar) files, a columnar storage file format optimized for processing large datasets. By utilizing this example, you can access and process ORC files, making it an essential tool for handling big data analytics, data warehousing, and other data-intensive applications. Its capabilities enhance data processing speed and efficiency when dealing with structured data stored in ORC format, resulting in optimized performance and streamlined workflows.


Java Code Listing

package com.northconcepts.datapipeline.examples.orc;

import java.io.File;

import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.orc.OrcDataReader;

public class ReadAnOrcFile {

    public static void main(String[] args) {
        OrcDataReader reader = new OrcDataReader(new File("example/data/input/input_orc_file.orc"));
        Job.run(reader, new StreamWriter(System.out));

        System.out.println("============================================================");
        System.out.println("ORC Schema");
        System.out.println("============================================================");
        
        System.out.println(reader.getSchema());
    }

}

 

Code Walkthrough

  1. OrcDataReader is created corresponding to the input file input_orc_file.orc.
  2. Job.run() is used to transfer the data from reader to StreamWriter(System.out). See how to compile and run data pipeline jobs.
  3. The schema of the ORC file is accessed via reader.getSchema() and printed on the console.

 

Console Output

-----------------------------------------------
0 - Record (MODIFIED) (has child records) {
    0:[boolean1]:BOOLEAN=[false]:Boolean
    1:[byte1]:BYTE=[1]:Byte
    2:[short1]:SHORT=[1024]:Short
    3:[int1]:INT=[65536]:Integer
    4:[long1]:LONG=[9223372036854775807]:Long
    5:[float1]:FLOAT=[1.0]:Float
    6:[double1]:DOUBLE=[-15.0]:Double
    7:[bytes1]:BLOB=[[0...4]]:[B
    8:[string1]:STRING=[hi]:String
    9:[middle]:RECORD=[
        Record (MODIFIED) (is child record) (has child records) {
            0:[list]:ARRAY of RECORD=[[
                Record (MODIFIED) (is child record) {
                    0:[int1]:INT=[1]:Integer
                    1:[string1]:STRING=[bye]:String
                }, 
                Record (MODIFIED) (is child record) {
                    0:[int1]:INT=[2]:Integer
                    1:[string1]:STRING=[sigh]:String
                }]]:ArrayValue
        }]:Record
    10:[list]:ARRAY of RECORD=[[
        Record (MODIFIED) (is child record) {
            0:[int1]:INT=[3]:Integer
            1:[string1]:STRING=[good]:String
        }, 
        Record (MODIFIED) (is child record) {
            0:[int1]:INT=[4]:Integer
            1:[string1]:STRING=[bad]:String
        }]]:ArrayValue
    11:[map]:STRING=[null]
}

-----------------------------------------------
1 - Record (MODIFIED) (has child records) {
    0:[boolean1]:BOOLEAN=[true]:Boolean
    1:[byte1]:BYTE=[100]:Byte
    2:[short1]:SHORT=[2048]:Short
    3:[int1]:INT=[65536]:Integer
    4:[long1]:LONG=[9223372036854775807]:Long
    5:[float1]:FLOAT=[2.0]:Float
    6:[double1]:DOUBLE=[-5.0]:Double
    7:[bytes1]:BLOB=[[]]:[B
    8:[string1]:STRING=[bye]:String
    9:[middle]:RECORD=[
        Record (MODIFIED) (is child record) (has child records) {
            0:[list]:ARRAY of RECORD=[[
                Record (MODIFIED) (is child record) {
                    0:[int1]:INT=[1]:Integer
                    1:[string1]:STRING=[bye]:String
                }, 
                Record (MODIFIED) (is child record) {
                    0:[int1]:INT=[2]:Integer
                    1:[string1]:STRING=[sigh]:String
                }]]:ArrayValue
        }]:Record
    10:[list]:ARRAY of RECORD=[[
        Record (MODIFIED) (is child record) {
            0:[int1]:INT=[100000000]:Integer
            1:[string1]:STRING=[cat]:String
        }, 
        Record (MODIFIED) (is child record) {
            0:[int1]:INT=[-100000]:Integer
            1:[string1]:STRING=[in]:String
        }, 
        Record (MODIFIED) (is child record) {
            0:[int1]:INT=[1234]:Integer
            1:[string1]:STRING=[hat]:String
        }]]:ArrayValue
    11:[map]:ARRAY of RECORD=[[
        Record (MODIFIED) (is child record) (has child records) {
            0:[key]:STRING=[chani]:String
            1:[value]:RECORD=[
                Record (MODIFIED) (is child record) {
                    0:[int1]:INT=[5]:Integer
                    1:[string1]:STRING=[chani]:String
                }]:Record
        }, 
        Record (MODIFIED) (is child record) (has child records) {
            0:[key]:STRING=[mauddib]:String
            1:[value]:RECORD=[
                Record (MODIFIED) (is child record) {
                    0:[int1]:INT=[1]:Integer
                    1:[string1]:STRING=[mauddib]:String
                }]:Record
        }]]:ArrayValue
}

-----------------------------------------------
2 records
============================================================
ORC Schema
============================================================
struct<boolean1:boolean,byte1:tinyint,short1:smallint,int1:int,long1:bigint,float1:float,double1:double,bytes1:binary,string1:string,middle:struct<list:array<struct<int1:int,string1:string>>>,list:array<struct<int1:int,string1:string>>,map:map<string,struct<int1:int,string1:string>>>
Mobile Analytics