Read Parquet from Amazon S3

Updated: Sep 12, 2023

This example shows how to read records from a Parquet file stored in the Amazon S3 file system. You can employ this example for data warehousing, analytics, and data science applications. Data Pipeline also supports writing Parquet files to Amazon S3.

Parquet is an open-source, column-oriented data file format built for efficient data storage and retrieval.

 

Java Code Listing

package com.northconcepts.datapipeline.examples.amazons3;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.hadoop.util.HadoopInputFile;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.parquet.ParquetDataReader;

public class ReadParquetFromAmazonS3 {

    private static final String ACCESS_KEY = "YOUR ACCESS KEY";
    private static final String SECRET_KEY = "YOUR SECRET KEY";

    public static void main(String[] args) throws Throwable {
        Path path = new Path("s3a://bucketName/input.parquet");

        Configuration configuration = new Configuration();
        configuration.set("fs.s3a.access.key", ACCESS_KEY);
        configuration.set("fs.s3a.secret.key", SECRET_KEY);
        configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");

        DataReader reader = new ParquetDataReader(HadoopInputFile.fromPath(path, configuration));
        DataWriter writer = new StreamWriter(System.out);

        Job.run(reader, writer);
    }
}

 

Code Walkthrough

  1. First, configuration parameters are specified using ACCESS_KEY and SECRET_KEY fields.
  2. Path is created to specify the target Parquet file (i.e. s3a://bucketName/input.parquet in this example) in the Amazon S3 bucket.
  3. Configuration is created to provide access to configuration parameters.
  4. configuration.set() binds the name specified in the first parameter to the value specified in the second parameter.
  5. HadoopInputFile is created to read a file from the path specified in the first parameter using the configuration specified in the second parameter.
  6. ParquetDataReader is created to read records from the specified HadoopInputFile object.
  7. Data are transferred from ParquetDataReader to the console via Job.run() method. See how to compile and run data pipeline jobs.

 

Console Output

Obtained records from the specified Parquet file will be printed on the console.

Mobile Analytics