Read An ORC file from Amazon S3
Updated: Jun 7, 2023
In this example, you will learn how you can read an ORC file by streaming it directly from Amazon's S3 Storage
ORC (Optimized Row Columnar) file is an open-source, column-oriented data file format that offers efficient data storage and retrieval. Primarily released for Apache Hive, it can also store other types of data.
Check out the ORC Examples for more use-cases.
Java Code Listing
package com.northconcepts.datapipeline.examples.amazons3; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.DataWriter; import com.northconcepts.datapipeline.core.StreamWriter; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.orc.OrcDataReader; public class ReadAnOrcFileFromAmazonS3 { private static final String ACCESS_KEY = "YOUR ACCESS KEY"; private static final String SECRET_KEY = "YOUR SECRET KEY"; public static void main(String[] args) throws Throwable { Path path = new Path("s3a://bucketName/input.orc"); Configuration configuration = new Configuration(); configuration.set("fs.s3a.access.key", ACCESS_KEY); configuration.set("fs.s3a.secret.key", SECRET_KEY); configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"); DataReader reader = new OrcDataReader(path) .setConfig(configuration); DataWriter writer = new StreamWriter(System.out); Job.run(reader, writer); } }
Code Walkthrough
- Beginning the execution, a new
Path
object is initialized with the file path of theinput.orc
file stored on the Amazon S3 bucket. - A
Configuration
object is created in order for multiple configuration parameters to be set. configuration.set()
method is called with various parameters such as"fs.s3a.access.key"
and"fs.s3a.secret.key"
with values initialized before such asACCESS_KEY
andSECRET_KEY
.- A
DataReader
is then created throughOrcDataReader
constructor with thepath
initialized above. - .
setConfig()
is passed with theconfiguration
object created above. - A
DataWriter
is created throughStreamWriter
withSystem.out
as its parameters so that the output is displayed on the console.
Console Output
The data present in the ORC file from the Amazon S3 bucket will be displayed on the console.