Read An ORC file from Amazon S3 using a Temporary File
Updated: Jun 16, 2023
In this example, you will learn how you can read an ORC file by downloading it from Amazon S3 Storage using a temporary file.
ORC (Optimized Row Columnar) file is an open-source, column-oriented data file format that offers efficient data storage and retrieval. Primarily released for Apache Hive, it can also store other types of data.
See other ORC Examples for similar use-cases.
Java Code Listing
package com.northconcepts.datapipeline.examples.amazons3; import java.io.BufferedOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.InputStream; import java.io.OutputStream; import com.northconcepts.datapipeline.amazons3.AmazonS3FileSystem; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.DataWriter; import com.northconcepts.datapipeline.core.StreamWriter; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.orc.OrcDataReader; public class ReadAnOrcFileFromAmazonS3UsingATemporaryFile { private static final String ACCESS_KEY = "YOUR ACCESS KEY"; private static final String SECRET_KEY = "YOUR SECRET KEY"; private static File orcFile; public static void main(String[] args) throws Throwable { downloadS3File(); try { DataReader reader = new OrcDataReader(orcFile); DataWriter writer = new StreamWriter(System.out); Job.run(reader, writer); } finally { orcFile.delete(); } } private static void downloadS3File() throws Throwable { AmazonS3FileSystem s3 = new AmazonS3FileSystem(); try { s3.setBasicAWSCredentials(ACCESS_KEY, SECRET_KEY); s3.open(); orcFile = File.createTempFile("output", ".orc"); orcFile.deleteOnExit(); InputStream in = s3.readFile("bucket", "input.orc"); OutputStream out = new BufferedOutputStream(new FileOutputStream(orcFile)); byte[] buffer = new byte[1024]; int lengthRead; while ((lengthRead = in.read(buffer)) > 0) { out.write(buffer, 0, lengthRead); out.flush(); } } finally { s3.close(); } } }
Code Walkthrough
- The
downloadS3File()
method creates an instance ofAmazonS3FileSystem
and establishes a connection to the Amazon S3 service using the access key and secret key. - A temporary file
"output.orc"
is created usingcreateTempFile()
. - Using the
readFile()
method,"input.orc"
file from the specified S3 bucket is stored in anInputStream
. - An
OutputStream
is used to write the data fromInputStream
into the temporaryorcFile
. - A job is run so that ORC file data is displayed on the console using a
DataReader
andDataWriter
. - Finally, the temporary file is deleted using
.delete()
within afinally
block.
Console Output
The data in the ORC file read from Amazon S3 bucket will be shown as output on the console.