Read Parquet from Amazon S3 using a Temporary File
Updated: Jun 7, 2023
In this example, you will learn how to read a Parquet file by downloading it from Amazon S3 Storage using a temporary file.
Parquet file is a column-oriented data file format that offers efficient data storage and retrieval.
See the Parquet Examples for more use-cases.
Java Code Listing
package com.northconcepts.datapipeline.examples.amazons3;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import com.northconcepts.datapipeline.amazons3.AmazonS3FileSystem;
import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.parquet.ParquetDataReader;
public class ReadParquetFromAmazonS3UsingATemporaryFile {
private static final String ACCESS_KEY = "YOUR ACCESS KEY";
private static final String SECRET_KEY = "YOUR SECRET KEY";
private static File parquetFile;
public static void main(String[] args) throws Throwable {
downloadS3File();
try {
DataReader reader = new ParquetDataReader(parquetFile);
DataWriter writer = new StreamWriter(System.out);
Job.run(reader, writer);
} finally {
parquetFile.delete();
}
}
private static void downloadS3File() throws Throwable {
AmazonS3FileSystem s3 = new AmazonS3FileSystem();
try {
s3.setBasicAWSCredentials(ACCESS_KEY, SECRET_KEY);
s3.open();
parquetFile = File.createTempFile("output", ".parquet");
parquetFile.deleteOnExit();
InputStream in = s3.readFile("bucket", "input.parquet");
OutputStream out = new BufferedOutputStream(new FileOutputStream(parquetFile));
byte[] buffer = new byte[1024];
int lengthRead;
while ((lengthRead = in.read(buffer)) > 0) {
out.write(buffer, 0, lengthRead);
out.flush();
}
} finally {
s3.close();
}
}
}
Code Walkthrough
- Beginning the execution,
downloadS3File()method is invoked which has the following steps:
- A new
AmazonS3FileSystemobjects3is initialized. - S3 Storage access credentials
ACCESS_KEYandSECRET_KEYare passed tosetBasicAWSCredentials()method so that a connection to S3 Storage can be opened using theopen()method. - Filename
"output"and type".parquet"are passed tocreateTempFile()method so that anoutput.parquettemporary file can be created. deleteOnExit()is invoked so that once the program finishes the execution, the file created above is automatically deleted from the directory.- An
InputStreamis created to read the file from S3 storage by invoking thereadFile()method having two arguments; bucket name"bucket"and filename"input.parquet". - An
OutputStreamis created usingBufferedOutputStreamto allow buffer-based write, and instantiated using the temporary file objectparquetFilepassed toFileOutputStream. - A
bufferof 1024 bytes is created and used in awhileloop for reading the input file objectinusingread()method. - This
bufferis then used to write to the temporary file objectoutuntillengthReadfor thebufferis 0. flush()method is invoked to make sure thebufferis completely used before the stream is closed.- The
close()method is invoked to disconnect from S3 Storage.
- A new
- In the
tryblock, aDataReaderobject is initialized by passing theparquetFileto theParquetFileReaderconstructor. - Similarly, a
DataWriterobject is initialized withSystem.outpassed to theStreamWriterconstructor so that the data from theparquetFileshows up on the console. - In the
finallyblock, after the stream finishes displaying the output on the console, thedelete()method will remove the temporaryparquetFile.
Console Output
The data in the parquet file read from Amazon S3 bucket will be shown as output on the console.
