Read Parquet from Amazon S3 using a Temporary File
Updated: Jun 7, 2023
In this example, you will learn how to read a Parquet file by downloading it from Amazon S3 Storage using a temporary file.
Parquet file is a column-oriented data file format that offers efficient data storage and retrieval.
See the Parquet Examples for more use-cases.
Java Code Listing
package com.northconcepts.datapipeline.examples.amazons3; import java.io.BufferedOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.InputStream; import java.io.OutputStream; import com.northconcepts.datapipeline.amazons3.AmazonS3FileSystem; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.DataWriter; import com.northconcepts.datapipeline.core.StreamWriter; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.parquet.ParquetDataReader; public class ReadParquetFromAmazonS3UsingATemporaryFile { private static final String ACCESS_KEY = "YOUR ACCESS KEY"; private static final String SECRET_KEY = "YOUR SECRET KEY"; private static File parquetFile; public static void main(String[] args) throws Throwable { downloadS3File(); try { DataReader reader = new ParquetDataReader(parquetFile); DataWriter writer = new StreamWriter(System.out); Job.run(reader, writer); } finally { parquetFile.delete(); } } private static void downloadS3File() throws Throwable { AmazonS3FileSystem s3 = new AmazonS3FileSystem(); try { s3.setBasicAWSCredentials(ACCESS_KEY, SECRET_KEY); s3.open(); parquetFile = File.createTempFile("output", ".parquet"); parquetFile.deleteOnExit(); InputStream in = s3.readFile("bucket", "input.parquet"); OutputStream out = new BufferedOutputStream(new FileOutputStream(parquetFile)); byte[] buffer = new byte[1024]; int lengthRead; while ((lengthRead = in.read(buffer)) > 0) { out.write(buffer, 0, lengthRead); out.flush(); } } finally { s3.close(); } } }
Code Walkthrough
- Beginning the execution,
downloadS3File()
method is invoked which has the following steps:
- A new
AmazonS3FileSystem
objects3
is initialized. - S3 Storage access credentials
ACCESS_KEY
andSECRET_KEY
are passed tosetBasicAWSCredentials()
method so that a connection to S3 Storage can be opened using theopen()
method. - Filename
"output"
and type".parquet"
are passed tocreateTempFile()
method so that anoutput.parquet
temporary file can be created. deleteOnExit()
is invoked so that once the program finishes the execution, the file created above is automatically deleted from the directory.- An
InputStream
is created to read the file from S3 storage by invoking thereadFile()
method having two arguments; bucket name"bucket"
and filename"input.parquet"
. - An
OutputStream
is created usingBufferedOutputStream
to allow buffer-based write, and instantiated using the temporary file objectparquetFile
passed toFileOutputStream
. - A
buffer
of 1024 bytes is created and used in awhile
loop for reading the input file objectin
usingread()
method. - This
buffer
is then used to write to the temporary file objectout
untillengthRead
for thebuffer
is 0. flush()
method is invoked to make sure thebuffer
is completely used before the stream is closed.- The
close()
method is invoked to disconnect from S3 Storage.
- A new
- In the
try
block, aDataReader
object is initialized by passing theparquetFile
to theParquetFileReader
constructor. - Similarly, a
DataWriter
object is initialized withSystem.out
passed to theStreamWriter
constructor so that the data from theparquetFile
shows up on the console. - In the
finally
block, after the stream finishes displaying the output on the console, thedelete()
method will remove the temporaryparquetFile
.
Console Output
The data in the parquet file read from Amazon S3 bucket will be shown as output on the console.