Read an Orc File
Updated: Aug 22, 2023
This example shows how to read ORC (Optimized Row Columnar) files, a columnar storage file format optimized for processing large datasets. By utilizing this example, you can access and process ORC files, making it an essential tool for handling big data analytics, data warehousing, and other data-intensive applications. Its capabilities enhance data processing speed and efficiency when dealing with structured data stored in ORC format, resulting in optimized performance and streamlined workflows.
Java Code Listing
package com.northconcepts.datapipeline.examples.orc; import java.io.File; import com.northconcepts.datapipeline.core.StreamWriter; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.orc.OrcDataReader; public class ReadAnOrcFile { public static void main(String[] args) { OrcDataReader reader = new OrcDataReader(new File("example/data/input/input_orc_file.orc")); Job.run(reader, new StreamWriter(System.out)); System.out.println("============================================================"); System.out.println("ORC Schema"); System.out.println("============================================================"); System.out.println(reader.getSchema()); } }
Code Walkthrough
- OrcDataReader is created corresponding to the input file
input_orc_file.orc
. - Job.run() is used to transfer the data from
reader
toStreamWriter(System.out)
. See how to compile and run data pipeline jobs. - The schema of the ORC file is accessed via
reader.getSchema()
and printed on the console.
Console Output
----------------------------------------------- 0 - Record (MODIFIED) (has child records) { 0:[boolean1]:BOOLEAN=[false]:Boolean 1:[byte1]:BYTE=[1]:Byte 2:[short1]:SHORT=[1024]:Short 3:[int1]:INT=[65536]:Integer 4:[long1]:LONG=[9223372036854775807]:Long 5:[float1]:FLOAT=[1.0]:Float 6:[double1]:DOUBLE=[-15.0]:Double 7:[bytes1]:BLOB=[[0...4]]:[B 8:[string1]:STRING=[hi]:String 9:[middle]:RECORD=[ Record (MODIFIED) (is child record) (has child records) { 0:[list]:ARRAY of RECORD=[[ Record (MODIFIED) (is child record) { 0:[int1]:INT=[1]:Integer 1:[string1]:STRING=[bye]:String }, Record (MODIFIED) (is child record) { 0:[int1]:INT=[2]:Integer 1:[string1]:STRING=[sigh]:String }]]:ArrayValue }]:Record 10:[list]:ARRAY of RECORD=[[ Record (MODIFIED) (is child record) { 0:[int1]:INT=[3]:Integer 1:[string1]:STRING=[good]:String }, Record (MODIFIED) (is child record) { 0:[int1]:INT=[4]:Integer 1:[string1]:STRING=[bad]:String }]]:ArrayValue 11:[map]:STRING=[null] } ----------------------------------------------- 1 - Record (MODIFIED) (has child records) { 0:[boolean1]:BOOLEAN=[true]:Boolean 1:[byte1]:BYTE=[100]:Byte 2:[short1]:SHORT=[2048]:Short 3:[int1]:INT=[65536]:Integer 4:[long1]:LONG=[9223372036854775807]:Long 5:[float1]:FLOAT=[2.0]:Float 6:[double1]:DOUBLE=[-5.0]:Double 7:[bytes1]:BLOB=[[]]:[B 8:[string1]:STRING=[bye]:String 9:[middle]:RECORD=[ Record (MODIFIED) (is child record) (has child records) { 0:[list]:ARRAY of RECORD=[[ Record (MODIFIED) (is child record) { 0:[int1]:INT=[1]:Integer 1:[string1]:STRING=[bye]:String }, Record (MODIFIED) (is child record) { 0:[int1]:INT=[2]:Integer 1:[string1]:STRING=[sigh]:String }]]:ArrayValue }]:Record 10:[list]:ARRAY of RECORD=[[ Record (MODIFIED) (is child record) { 0:[int1]:INT=[100000000]:Integer 1:[string1]:STRING=[cat]:String }, Record (MODIFIED) (is child record) { 0:[int1]:INT=[-100000]:Integer 1:[string1]:STRING=[in]:String }, Record (MODIFIED) (is child record) { 0:[int1]:INT=[1234]:Integer 1:[string1]:STRING=[hat]:String }]]:ArrayValue 11:[map]:ARRAY of RECORD=[[ Record (MODIFIED) (is child record) (has child records) { 0:[key]:STRING=[chani]:String 1:[value]:RECORD=[ Record (MODIFIED) (is child record) { 0:[int1]:INT=[5]:Integer 1:[string1]:STRING=[chani]:String }]:Record }, Record (MODIFIED) (is child record) (has child records) { 0:[key]:STRING=[mauddib]:String 1:[value]:RECORD=[ Record (MODIFIED) (is child record) { 0:[int1]:INT=[1]:Integer 1:[string1]:STRING=[mauddib]:String }]:Record }]]:ArrayValue } ----------------------------------------------- 2 records ============================================================ ORC Schema ============================================================ struct<boolean1:boolean,byte1:tinyint,short1:smallint,int1:int,long1:bigint,float1:float,double1:double,bytes1:binary,string1:string,middle:struct<list:array<struct<int1:int,string1:string>>>,list:array<struct<int1:int,string1:string>>,map:map<string,struct<int1:int,string1:string>>>