Read from Amazon S3

Updated: Jun 29, 2023

This example shows how you can use Data Pipeline to read files from an Amazon S3 bucket. It provides a convenient and efficient way to access and process data stored in S3, which is a popular cloud-based storage service offered by Amazon Web Services (AWS).

Real-life use cases for this example can vary across different domains and industries. For example, in data analytics and business intelligence, the library can be used to extract data from S3 buckets for further analysis and reporting. Data engineers can leverage it to build data pipelines that involve reading and processing large volumes of data stored in S3.

 

Java Code Listing

package com.northconcepts.datapipeline.examples.cookbook;

import java.io.InputStream;
import java.io.InputStreamReader;

import com.northconcepts.datapipeline.amazons3.AmazonS3FileSystem;
import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.NullWriter;
import com.northconcepts.datapipeline.csv.CSVReader;
import com.northconcepts.datapipeline.job.Job;

public class ReadFromAmazonS3 {

    private static final String ACCESS_KEY = "YOUR ACCESS KEY";
    private static final String SECRET_KEY = "YOUR SECRET KEY";

    public static void main(String[] args) throws Throwable {
        AmazonS3FileSystem s3 = new AmazonS3FileSystem();
        s3.setBasicAWSCredentials(ACCESS_KEY, SECRET_KEY);
        s3.open();
        try {
            InputStream inputStream = s3.readFile("datapipeline-test-01", "output/trades.csv");

            DataReader reader = new CSVReader(new InputStreamReader(inputStream));
//            DataWriter writer = StreamWriter.newSystemOutWriter();
            DataWriter writer = new NullWriter();
            Job.run(reader, writer);
            
            System.out.println("Records read: " + writer.getRecordCount());
        } finally {
            s3.close();
        }
    }

}

 

Code Walkthrough

  1. Beginning the execution, an AmazonS3FileSystem instance is initialized with basic credentials ACCESS_KEY and SECRET_KEY.
  2. A connection to the Amazon S3 file system is established with open() method.
  3. readFile() method is invoked with the bucket "datapipeline-test-01" and file "output/trades.csv". The result is then stored in an InputStream instance.
  4. A CSV type DataReader is then passed to Job.run() method to transfer records to a NullWriter. This writer can be of any desired type for accessing the records.
  5. Finally, the count of records in the provided file is printed on the console.

 

Console Output

The count of obtained records will be printed in the console.

Mobile Analytics