Write Parquet to Amazon S3 Using a Temporary File
Updated: Jun 1, 2023
This example demonstrates how to read a local CSV file, convert it to a temporary Parquet file and then write that temporary file to an Amazon S3 bucket.
CSV input
Account,LastName,FirstName,Balance,CreditLimit,AccountCreated,Rating 101,Reeves,Keanu,9315.45,10000,17-01-1998,A 312,Butler,Gerard,90,1000,06-08-2003,B 101,Hewitt,Jennifer Love,0,17000,25-05-1985,B 312,Pinkett-Smith,Jada,49654.87,100000,05-12-2006,A 317,Murray,Bill,789.65,5000,05-02-2007,C 317,Murray,Bill,1,5000,05-02-2007,D
Java Code
/* * Copyright (c) 2006-2022 North Concepts Inc. All rights reserved. * Proprietary and Confidential. Use is subject to license terms. * * https://northconcepts.com/data-pipeline/licensing/ */ package com.northconcepts.datapipeline.examples.amazons3; import java.io.BufferedInputStream; import java.io.File; import java.io.FileInputStream; import java.io.InputStream; import java.io.OutputStream; import com.northconcepts.datapipeline.amazons3.AmazonS3FileSystem; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.csv.CSVReader; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.parquet.ParquetDataWriter; public class WriteParquetToAmazonS3UsingATemporaryFile { private static final String ACCESS_KEY = "YOUR ACCESS KEY"; private static final String SECRET_KEY = "YOUR SECRET KEY"; public static void main(String[] args) throws Throwable { File parquetFile = File.createTempFile("credit-balance", ".parquet"); parquetFile.deleteOnExit(); try { DataReader reader = new CSVReader(new File("example/data/input/credit-balance.csv")) .setFieldNamesInFirstRow(true); ParquetDataWriter writer = new ParquetDataWriter(parquetFile); Job.run(reader, writer); uploadFileToS3(parquetFile); } finally { parquetFile.delete(); } } private static void uploadFileToS3(File parquetFile) throws Throwable { AmazonS3FileSystem s3 = new AmazonS3FileSystem(); try { s3.setBasicAWSCredentials(ACCESS_KEY, SECRET_KEY); s3.open(); OutputStream out = s3.writeMultipartFile("datapipeline-test-01", "output/credit-balance.parquet"); InputStream in = new BufferedInputStream(new FileInputStream(parquetFile)); byte[] buffer = new byte[1024]; int lengthRead; while ((lengthRead = in.read(buffer)) > 0) { out.write(buffer, 0, lengthRead); } } finally { s3.close(); } } }
Code Walkthrough
- A temporary Parquet file is created and is set to delete itself on exit.
- A CSVReader is created to read from the local file
credit-balance.csv
. - A ParquetDataWriter is then created taking the temporary file for writing.
- Both the reader in step 2 and the writer in step 3 are ran inside a Job, writing the CSV to the temporary Parquet file.
- The file is uploaded to S3.