Write An Orc file to Amazon S3

Updated: Aug 22, 2023

This example enables the writing of records to ORC (Optimized Row Columnar) files stored in Amazon S3. By leveraging this tool, you can save structured data into ORC format, offering optimal storage and query performance. This can be used for large-scale data processing, data warehousing, and analytics, where data integrity and efficient storage are crucial for delivering actionable insights. DataPipeline empowers you to harness the benefits of ORC's columnar storage structure while utilizing the scalability and reliability of Amazon S3 for storing their data.

 

Java Code Listing

package com.northconcepts.datapipeline.examples.amazons3;

import java.io.File;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.csv.CSVReader;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.orc.OrcDataWriter;

public class WriteAnOrcFileToAmazonS3 {

    private static final String ACCESS_KEY = "YOUR ACCESS KEY";
    private static final String SECRET_KEY = "YOUR SECRET KEY";

    public static void main(String[] args) throws Throwable {
        Path path = new Path("s3a://bucketName/output.orc");

        Configuration configuration = new Configuration();
        configuration.set("fs.s3a.access.key", ACCESS_KEY);
        configuration.set("fs.s3a.secret.key", SECRET_KEY);
        configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");

        DataReader reader = new CSVReader(new File("example/data/input/credit-balance.csv"))
                    .setFieldNamesInFirstRow(true);

        OrcDataWriter writer = new OrcDataWriter(path);

        Job.run(reader, writer);
    }
}

 

Code Walkthrough

  1. First, AWS S3 access configurations and an object path are specified.
  2. CSVReader is created corresponding to the input file credit-balance.csv.
  3. OrcDataWriter is created to write records to the specified ORC file  output.orc.
  4. Data is transferred from reader to OrcDataWriter via Job.run() method. See how to compile and run data pipeline jobs.

 

OrcDataWriter

Writes records to Apache ORC columnar files. It extends IntegrationWriter class and can be created using File or Path object. Using this class you can also specify the configuration, schema, and batch size (default 1024) of an output ORC file.

 

Mobile Analytics