Read An ORC file from Amazon S3 using a Temporary File

Updated: Jun 16, 2023

In this example, you will learn how you can read an ORC file by downloading it from Amazon S3 Storage using a temporary file.

ORC (Optimized Row Columnar) file is an open-source, column-oriented data file format that offers efficient data storage and retrieval. Primarily released for Apache Hive, it can also store other types of data.

See other ORC Examples for similar use-cases.

Java Code Listing

package com.northconcepts.datapipeline.examples.amazons3;

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;

import com.northconcepts.datapipeline.amazons3.AmazonS3FileSystem;
import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.orc.OrcDataReader;

public class ReadAnOrcFileFromAmazonS3UsingATemporaryFile {

    private static final String ACCESS_KEY = "YOUR ACCESS KEY";
    private static final String SECRET_KEY = "YOUR SECRET KEY";

    private static File orcFile;

    public static void main(String[] args) throws Throwable {
        downloadS3File();
        try {
            DataReader reader = new OrcDataReader(orcFile);
            DataWriter writer = new StreamWriter(System.out);

            Job.run(reader, writer);
        } finally {
            orcFile.delete();
        }
    }

    private static void downloadS3File() throws Throwable {
        AmazonS3FileSystem s3 = new AmazonS3FileSystem();
        try {
            s3.setBasicAWSCredentials(ACCESS_KEY, SECRET_KEY);
            s3.open();

            orcFile = File.createTempFile("output", ".orc");
            orcFile.deleteOnExit();

            InputStream in = s3.readFile("bucket", "input.orc");
            OutputStream out = new BufferedOutputStream(new FileOutputStream(orcFile));

            byte[] buffer = new byte[1024];
            int lengthRead;
            while ((lengthRead = in.read(buffer)) > 0) {
                out.write(buffer, 0, lengthRead);
                out.flush();
            }
        } finally {
            s3.close();
        }
    }
}


 

Code Walkthrough

  1. The downloadS3File() method creates an instance of AmazonS3FileSystem and establishes a connection to the Amazon S3 service using the access key and secret key.
  2. A temporary file "output.orc" is created using createTempFile().
  3. Using the readFile() method, "input.orc" file from the specified S3 bucket is stored in an InputStream.
  4. An OutputStream is used to write the data from InputStream into the temporary orcFile.
  5. A job is run so that ORC file data is displayed on the console using a DataReader and DataWriter.
  6. Finally, the temporary file is deleted using .delete() within a finally block.

 

Console Output

The data in the ORC file read from Amazon S3 bucket will be shown as output on the console.

Mobile Analytics