Read Parquet from Amazon S3
Updated: Sep 12, 2023
This example shows how to read records from a Parquet file stored in the Amazon S3 file system. You can employ this example for data warehousing, analytics, and data science applications. Data Pipeline also supports writing Parquet files to Amazon S3.
Parquet is an open-source, column-oriented data file format built for efficient data storage and retrieval.
Java Code Listing
package com.northconcepts.datapipeline.examples.amazons3; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.parquet.hadoop.util.HadoopInputFile; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.DataWriter; import com.northconcepts.datapipeline.core.StreamWriter; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.parquet.ParquetDataReader; public class ReadParquetFromAmazonS3 { private static final String ACCESS_KEY = "YOUR ACCESS KEY"; private static final String SECRET_KEY = "YOUR SECRET KEY"; public static void main(String[] args) throws Throwable { Path path = new Path("s3a://bucketName/input.parquet"); Configuration configuration = new Configuration(); configuration.set("fs.s3a.access.key", ACCESS_KEY); configuration.set("fs.s3a.secret.key", SECRET_KEY); configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"); DataReader reader = new ParquetDataReader(HadoopInputFile.fromPath(path, configuration)); DataWriter writer = new StreamWriter(System.out); Job.run(reader, writer); } }
Code Walkthrough
- First, configuration parameters are specified using
ACCESS_KEY
andSECRET_KEY
fields. - Path is created to specify the target Parquet file (i.e.
s3a://bucketName/input.parquet
in this example) in the Amazon S3 bucket. - Configuration is created to provide access to configuration parameters.
configuration.set()
binds the name specified in the first parameter to the value specified in the second parameter.- HadoopInputFile is created to read a file from the path specified in the first parameter using the configuration specified in the second parameter.
- ParquetDataReader is created to read records from the specified
HadoopInputFile
object. - Data are transferred from
ParquetDataReader
to the console via Job.run() method. See how to compile and run data pipeline jobs.
Console Output
Obtained records from the specified Parquet file will be printed on the console.