Read a Parquet File

Updated: Jun 12, 2022

This example shows you how to read data from a Parquet file using Data Pipeline.

In this demo code you are going to use ParquetDataReader to read records from an input Parquet file and get the schema of the file.

Parquet is an open source, column-oriented data file format built for efficient data storage and retrieval.

This example can easily be modified to show how to write to a Parquet file using Data Pipeline's ParquetDataWriter.

Java Code listing

package com.northconcepts.datapipeline.examples.parquet;

import java.io.File;

import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.parquet.ParquetDataReader;

public class ReadAParquetFile {

    public static void main(String[] args) {
        ParquetDataReader reader = new ParquetDataReader(new File("example/data/input/read_parquet_file.parquet"));
        Job.run(reader, new StreamWriter(System.out));

        System.out.println("============================================================");
        System.out.println("Parquet Schema");
        System.out.println("============================================================");
        
        System.out.println(reader.getSchema());
    }
}

Code Walkthrough

  1. ParquetDataReader is created to stream records from an input Parquet file read_parquet_file.parquet.
  2. Data are transferred from ParquetDataReader to the console via Job.run() method. See how to compile and run data pipeline jobs.
  3. reader.getSchema() method returns the schema that is used to read the specified Parquet file. This schema contains a group of fields called a message. Each field in the message has three attributes: a repetition, a type and a name .

ParquetDataReader

Read records from Apache Parquet columnar files. Using this class you can get the configuration, schema, filter setting and other metadatas of a given Parquet file.

Console output

0 - Record (MODIFIED) {
    0:[id]:INT=[4]:Integer
    1:[bool_col]:BOOLEAN=[true]:Boolean
    2:[tinyint_col]:INT=[0]:Integer
    3:[smallint_col]:INT=[0]:Integer
    4:[int_col]:INT=[0]:Integer
    5:[bigint_col]:LONG=[0]:Long
    6:[float_col]:FLOAT=[0.0]:Float
    7:[double_col]:DOUBLE=[0.0]:Double
    8:[date_string_col]:STRING=[03/01/09]:String
    9:[string_col]:STRING=[0]:String
    10:[timestamp_col]:DATETIME=[Sun Mar 01 12:00:00 EAT 2009]:Date
}

-----------------------------------------------
1 - Record (MODIFIED) {
    0:[id]:INT=[5]:Integer
    1:[bool_col]:BOOLEAN=[false]:Boolean
    2:[tinyint_col]:INT=[1]:Integer
    3:[smallint_col]:INT=[1]:Integer
    4:[int_col]:INT=[1]:Integer
    5:[bigint_col]:LONG=[10]:Long
    6:[float_col]:FLOAT=[1.1]:Float
    7:[double_col]:DOUBLE=[10.1]:Double
    8:[date_string_col]:STRING=[03/01/09]:String
    9:[string_col]:STRING=[1]:String
    10:[timestamp_col]:DATETIME=[Sun Mar 01 12:01:00 EAT 2009]:Date
}

-----------------------------------------------
2 - Record (MODIFIED) {
    0:[id]:INT=[6]:Integer
    1:[bool_col]:BOOLEAN=[true]:Boolean
    2:[tinyint_col]:INT=[0]:Integer
    3:[smallint_col]:INT=[0]:Integer
    4:[int_col]:INT=[0]:Integer
    5:[bigint_col]:LONG=[0]:Long
    6:[float_col]:FLOAT=[0.0]:Float
    7:[double_col]:DOUBLE=[0.0]:Double
    8:[date_string_col]:STRING=[04/01/09]:String
    9:[string_col]:STRING=[0]:String
    10:[timestamp_col]:DATETIME=[Wed Apr 01 12:00:00 EAT 2009]:Date
}

-----------------------------------------------
3 - Record (MODIFIED) {
    0:[id]:INT=[7]:Integer
    1:[bool_col]:BOOLEAN=[false]:Boolean
    2:[tinyint_col]:INT=[1]:Integer
    3:[smallint_col]:INT=[1]:Integer
    4:[int_col]:INT=[1]:Integer
    5:[bigint_col]:LONG=[10]:Long
    6:[float_col]:FLOAT=[1.1]:Float
    7:[double_col]:DOUBLE=[10.1]:Double
    8:[date_string_col]:STRING=[04/01/09]:String
    9:[string_col]:STRING=[1]:String
    10:[timestamp_col]:DATETIME=[Wed Apr 01 12:01:00 EAT 2009]:Date
}

-----------------------------------------------
4 - Record (MODIFIED) {
    0:[id]:INT=[2]:Integer
    1:[bool_col]:BOOLEAN=[true]:Boolean
    2:[tinyint_col]:INT=[0]:Integer
    3:[smallint_col]:INT=[0]:Integer
    4:[int_col]:INT=[0]:Integer
    5:[bigint_col]:LONG=[0]:Long
    6:[float_col]:FLOAT=[0.0]:Float
    7:[double_col]:DOUBLE=[0.0]:Double
    8:[date_string_col]:STRING=[02/01/09]:String
    9:[string_col]:STRING=[0]:String
    10:[timestamp_col]:DATETIME=[Sun Feb 01 12:00:00 EAT 2009]:Date
}

-----------------------------------------------
5 - Record (MODIFIED) {
    0:[id]:INT=[3]:Integer
    1:[bool_col]:BOOLEAN=[false]:Boolean
    2:[tinyint_col]:INT=[1]:Integer
    3:[smallint_col]:INT=[1]:Integer
    4:[int_col]:INT=[1]:Integer
    5:[bigint_col]:LONG=[10]:Long
    6:[float_col]:FLOAT=[1.1]:Float
    7:[double_col]:DOUBLE=[10.1]:Double
    8:[date_string_col]:STRING=[02/01/09]:String
    9:[string_col]:STRING=[1]:String
    10:[timestamp_col]:DATETIME=[Sun Feb 01 12:01:00 EAT 2009]:Date
}

-----------------------------------------------
6 - Record (MODIFIED) {
    0:[id]:INT=[0]:Integer
    1:[bool_col]:BOOLEAN=[true]:Boolean
    2:[tinyint_col]:INT=[0]:Integer
    3:[smallint_col]:INT=[0]:Integer
    4:[int_col]:INT=[0]:Integer
    5:[bigint_col]:LONG=[0]:Long
    6:[float_col]:FLOAT=[0.0]:Float
    7:[double_col]:DOUBLE=[0.0]:Double
    8:[date_string_col]:STRING=[01/01/09]:String
    9:[string_col]:STRING=[0]:String
    10:[timestamp_col]:DATETIME=[Thu Jan 01 12:00:00 EAT 2009]:Date
}

-----------------------------------------------
7 - Record (MODIFIED) {
    0:[id]:INT=[1]:Integer
    1:[bool_col]:BOOLEAN=[false]:Boolean
    2:[tinyint_col]:INT=[1]:Integer
    3:[smallint_col]:INT=[1]:Integer
    4:[int_col]:INT=[1]:Integer
    5:[bigint_col]:LONG=[10]:Long
    6:[float_col]:FLOAT=[1.1]:Float
    7:[double_col]:DOUBLE=[10.1]:Double
    8:[date_string_col]:STRING=[01/01/09]:String
    9:[string_col]:STRING=[1]:String
    10:[timestamp_col]:DATETIME=[Thu Jan 01 12:01:00 EAT 2009]:Date
}

-----------------------------------------------
8 records
04:02:34,184 DEBUG [main] datapipeline:661 - job::Success
============================================================
Parquet Schema
============================================================
message schema {
  optional int32 id;
  optional boolean bool_col;
  optional int32 tinyint_col;
  optional int32 smallint_col;
  optional int32 int_col;
  optional int64 bigint_col;
  optional float float_col;
  optional double double_col;
  optional binary date_string_col;
  optional binary string_col;
  optional int96 timestamp_col;
}
Mobile Analytics