Write Parquet to Amazon S3 Using a Temporary File

Updated: Jun 1, 2023

This example demonstrates how to read a local CSV file, convert it to a temporary Parquet file and then write that temporary file to an Amazon S3 bucket.

CSV input

 
Account,LastName,FirstName,Balance,CreditLimit,AccountCreated,Rating
101,Reeves,Keanu,9315.45,10000,17-01-1998,A
312,Butler,Gerard,90,1000,06-08-2003,B
101,Hewitt,Jennifer Love,0,17000,25-05-1985,B
312,Pinkett-Smith,Jada,49654.87,100000,05-12-2006,A
317,Murray,Bill,789.65,5000,05-02-2007,C
317,Murray,Bill,1,5000,05-02-2007,D

Java Code

/*
 * Copyright (c) 2006-2022 North Concepts Inc.  All rights reserved.
 * Proprietary and Confidential.  Use is subject to license terms.
 * 
 * https://northconcepts.com/data-pipeline/licensing/
 */
package com.northconcepts.datapipeline.examples.amazons3;

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;

import com.northconcepts.datapipeline.amazons3.AmazonS3FileSystem;
import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.csv.CSVReader;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.parquet.ParquetDataWriter;

public class WriteParquetToAmazonS3UsingATemporaryFile {

    private static final String ACCESS_KEY = "YOUR ACCESS KEY";
    private static final String SECRET_KEY = "YOUR SECRET KEY";

    public static void main(String[] args) throws Throwable {

        File parquetFile = File.createTempFile("credit-balance", ".parquet");
        parquetFile.deleteOnExit();

        try {
            DataReader reader = new CSVReader(new File("example/data/input/credit-balance.csv"))
                .setFieldNamesInFirstRow(true);
            ParquetDataWriter writer = new ParquetDataWriter(parquetFile);

            Job.run(reader, writer);

            uploadFileToS3(parquetFile);
        } finally {
            parquetFile.delete();
        }
    }

    private static void uploadFileToS3(File parquetFile) throws Throwable {
        AmazonS3FileSystem s3 = new AmazonS3FileSystem();
        try {
            s3.setBasicAWSCredentials(ACCESS_KEY, SECRET_KEY);
            s3.open();

            OutputStream out = s3.writeMultipartFile("datapipeline-test-01", "output/credit-balance.parquet");
            InputStream in = new BufferedInputStream(new FileInputStream(parquetFile));

            byte[] buffer = new byte[1024];
            int lengthRead;
            while ((lengthRead = in.read(buffer)) > 0) {
                out.write(buffer, 0, lengthRead);
            }
        } finally {
            s3.close();
        }
    }
}

 

Code Walkthrough

  1. A temporary Parquet file is created and is set to delete itself on exit.
  2. CSVReader is created to read from the local file credit-balance.csv.
  3. A ParquetDataWriter  is then created taking the temporary file for writing.
  4. Both the reader in step 2 and the writer in step 3 are ran inside a Job, writing the CSV to the temporary Parquet file.
  5. The file is uploaded to S3.
Mobile Analytics