Read An ORC file from Amazon S3

Updated: Jun 7, 2023

In this example, you will learn how you can read an ORC file by streaming it directly from Amazon's S3 Storage

ORC (Optimized Row Columnar) file is an open-source, column-oriented data file format that offers efficient data storage and retrieval. Primarily released for Apache Hive, it can also store other types of data.

Check out the ORC Examples for more use-cases.

Java Code Listing

package com.northconcepts.datapipeline.examples.amazons3;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.orc.OrcDataReader;

public class ReadAnOrcFileFromAmazonS3 {

    private static final String ACCESS_KEY = "YOUR ACCESS KEY";
    private static final String SECRET_KEY = "YOUR SECRET KEY";

    public static void main(String[] args) throws Throwable {
        Path path = new Path("s3a://bucketName/input.orc");

        Configuration configuration = new Configuration();
        configuration.set("fs.s3a.access.key", ACCESS_KEY);
        configuration.set("fs.s3a.secret.key", SECRET_KEY);
        configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");

        DataReader reader = new OrcDataReader(path)
                    .setConfig(configuration);
        DataWriter writer = new StreamWriter(System.out);

        Job.run(reader, writer);
    }
}


Code Walkthrough

  1. Beginning the execution, a new Path object is initialized with the file path of the input.orc file stored on the Amazon S3 bucket.
  2. A Configuration object is created in order for multiple configuration parameters to be set.
  3. configuration.set() method is called with various parameters such as "fs.s3a.access.key" and "fs.s3a.secret.key" with values initialized before such as ACCESS_KEY and SECRET_KEY.
  4. A DataReader is then created through OrcDataReader constructor with the path initialized above.
  5. .setConfig() is passed with the configuration object created above.
  6. A DataWriter is created through StreamWriter with System.out as its parameters so that the output is displayed on the console.

Console Output

The data present in the ORC file from the Amazon S3 bucket will be displayed on the console.

Mobile Analytics