Read An ORC file from Amazon S3
Updated: Jun 7, 2023
In this example, you will learn how you can read an ORC file by streaming it directly from Amazon's S3 Storage
ORC (Optimized Row Columnar) file is an open-source, column-oriented data file format that offers efficient data storage and retrieval. Primarily released for Apache Hive, it can also store other types of data.
Check out the ORC Examples for more use-cases.
Java Code Listing
package com.northconcepts.datapipeline.examples.amazons3;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.orc.OrcDataReader;
public class ReadAnOrcFileFromAmazonS3 {
private static final String ACCESS_KEY = "YOUR ACCESS KEY";
private static final String SECRET_KEY = "YOUR SECRET KEY";
public static void main(String[] args) throws Throwable {
Path path = new Path("s3a://bucketName/input.orc");
Configuration configuration = new Configuration();
configuration.set("fs.s3a.access.key", ACCESS_KEY);
configuration.set("fs.s3a.secret.key", SECRET_KEY);
configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
DataReader reader = new OrcDataReader(path)
.setConfig(configuration);
DataWriter writer = new StreamWriter(System.out);
Job.run(reader, writer);
}
}
Code Walkthrough
- Beginning the execution, a new
Pathobject is initialized with the file path of theinput.orcfile stored on the Amazon S3 bucket. - A
Configurationobject is created in order for multiple configuration parameters to be set. configuration.set()method is called with various parameters such as"fs.s3a.access.key"and"fs.s3a.secret.key"with values initialized before such asACCESS_KEYandSECRET_KEY.- A
DataReaderis then created throughOrcDataReaderconstructor with thepathinitialized above. - .
setConfig()is passed with theconfigurationobject created above. - A
DataWriteris created throughStreamWriterwithSystem.outas its parameters so that the output is displayed on the console.
Console Output
The data present in the ORC file from the Amazon S3 bucket will be displayed on the console.
