Read Parquet from Amazon S3 using a Temporary File

Updated: Jun 7, 2023

In this example, you will learn how to read a Parquet file by downloading it from Amazon S3 Storage using a temporary file.

Parquet file is a column-oriented data file format that offers efficient data storage and retrieval.

See the Parquet Examples for more use-cases.

Java Code Listing

package com.northconcepts.datapipeline.examples.amazons3;

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;

import com.northconcepts.datapipeline.amazons3.AmazonS3FileSystem;
import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.parquet.ParquetDataReader;

public class ReadParquetFromAmazonS3UsingATemporaryFile {

    private static final String ACCESS_KEY = "YOUR ACCESS KEY";
    private static final String SECRET_KEY = "YOUR SECRET KEY";

    private static File parquetFile;

    public static void main(String[] args) throws Throwable {
        downloadS3File();
        try {

            DataReader reader = new ParquetDataReader(parquetFile);
            DataWriter writer = new StreamWriter(System.out);

            Job.run(reader, writer);
        } finally {
            parquetFile.delete();
        }
    }

    private static void downloadS3File() throws Throwable {
        AmazonS3FileSystem s3 = new AmazonS3FileSystem();
        try {
            s3.setBasicAWSCredentials(ACCESS_KEY, SECRET_KEY);
            s3.open();

            parquetFile = File.createTempFile("output", ".parquet");
            parquetFile.deleteOnExit();

            InputStream in = s3.readFile("bucket", "input.parquet");
            OutputStream out = new BufferedOutputStream(new FileOutputStream(parquetFile));

            byte[] buffer = new byte[1024];
            int lengthRead;
            while ((lengthRead = in.read(buffer)) > 0) {
                out.write(buffer, 0, lengthRead);
                out.flush();
            }
        } finally {
            s3.close();
        }
    }
}


Code Walkthrough

  1. Beginning the execution, downloadS3File() method is invoked which has the following steps:
    • A new AmazonS3FileSystem object s3 is initialized.
    • S3 Storage access credentials ACCESS_KEY and SECRET_KEY are passed to setBasicAWSCredentials() method so that a connection to S3 Storage can be opened using the open() method.
    • Filename "output" and type ".parquet" are passed to createTempFile() method so that an output.parquet temporary file can be created.
    • deleteOnExit() is invoked so that once the program finishes the execution, the file created above is automatically deleted from the directory.
    • An InputStream is created to read the file from S3 storage by invoking the readFile() method having two arguments; bucket name "bucket" and filename "input.parquet".
    • An OutputStream is created using BufferedOutputStream to allow buffer-based write, and instantiated using the temporary file object parquetFile passed to FileOutputStream.
    • A buffer of 1024 bytes is created and used in a while loop for reading the input file object in using read() method.
    • This buffer is then used to write to the temporary file object out until lengthRead for the buffer is 0.
    • flush() method is invoked to make sure the buffer is completely used before the stream is closed.
    • The close() method is invoked to disconnect from S3 Storage.
  2. In the try block, a DataReader object is initialized by passing the parquetFile to the ParquetFileReader constructor.
  3. Similarly, a DataWriter object is initialized with System.out passed to the StreamWriter constructor so that the data from the parquetFile shows up on the console.
  4. In the finally block, after the stream finishes displaying the output on the console, the delete() method will remove the temporary parquetFile.

Console Output

The data in the parquet file read from Amazon S3 bucket will be shown as output on the console.

Mobile Analytics