Convert Unstructured JSON to Parquet

Updated: Aug 22, 2023

This example shows how to convert unstructured JSON data into the Parquet format. By converting JSON to Parquet, you can benefit from improved query performance, reduced storage space, and enhanced compatibility with data analytics platforms. This can be used in big data processing, data warehousing, and business intelligence applications.

As the title mentions, this example specifically shows how to convert complex JSON files. You can view this example for JSON files with the same-structured objects.

 

Input JSON file

[
  {
    "id": "358464",
    "t": "MSFT",
    "e": "NASDAQ",
    "l": "31.67",
    "l_cur": "31.67",
    "s": "2",
    "ltt": "4:00PM EDT",
    "lt": "Aug 1, 4:00PM EDT",
    "c": "-0.17",
    "cp": "-0.53",
    "ccol": "chr",
    "el": "31.69",
    "el_cur": "31.69",
    "elt": "Aug 1, 7:54PM EDT",
    "ec": "+0.02",
    "ecp": "0.06",
    "eccol": "chg",
    "div": "0.23",
    "yld": "2.90"
  },
  {
    "id": "419344",
    "t": "ORCL",
    "e": "NYSE",
    "l": "32.75",
    "l_cur": "32.75",
    "s": "2",
    "ltt": "4:00PM EDT",
    "lt": "Aug 1, 4:00PM EDT",
    "c": "+0.40",
    "cp": "1.24",
    "ccol": "chg",
    "el": "32.70",
    "el_cur": "32.70",
    "elt": "Aug 1, 7:15PM EDT",
    "ec": "-0.05",
    "ecp": "-0.15",
    "eccol": "chr",
    "div": "",
    "yld": "1.47"
  },
  {
    "id": "4112",
    "t": "ADBE",
    "e": "NASDAQ",
    "l": "47.70",
    "l_cur": "47.70",
    "s": "0",
    "ltt": "4:00PM EDT",
    "lt": "Aug 1, 4:00PM EDT",
    "c": "+0.42",
    "cp": "0.89",
    "ccol": "chg"
  }
]

 

Java Code Listing

package com.northconcepts.datapipeline.examples.cookbook;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.json.JsonReader;
import com.northconcepts.datapipeline.parquet.ParquetDataReader;
import com.northconcepts.datapipeline.parquet.ParquetDataWriter;

import java.io.File;

public class ConvertUnstructuredJsonToParquet {
    private static final File PARQUET_FILE = new File("example/data/output/WriteAParquetFileFromUnstructuredJson.parquet");

    public static void main(String[] args) {
        DataReader reader = new JsonReader(new File("example/data/input/finance.json"))
                .addField("symbol", "//array/object/t")
                .addField("exchange", "//array/object/e")
                .addField("price", "//array/object/l")
                .addField("change", "//array/object/c")
                .addRecordBreak("//array/object");

        ParquetDataWriter writer = new ParquetDataWriter(PARQUET_FILE);

        Job.run(reader, writer);

        System.out.println("=======================Reading Parquet File============================================");
        reader = new ParquetDataReader(PARQUET_FILE);
        Job.run(reader, new StreamWriter(System.out));
    }
}

 

Code Walkthrough

  1. A new JsonReader is created to specify the input file finance.json .
  2. The fields to be read from the JSON file are declared via the addField() method. 
  3. Since the JSON file is an array of objects, to access data in each field we must first traverse the array: addField("exchange", "//array/object/e"). This method traverses through the objects and searches for a field with the name e, and returns the value to the field exchange.
  4. The JsonReader.addRecordBreak method is invoked to demarcate records.
  5. ParquetDataWriter is created to write records to the specified Parquet file WriteAParquetFileFromUnstructuredJson.parquet.
  6. Job.run(reader, writer) is used to transfer the data from the JSON file to Parquet. See how to compile and run data pipeline jobs. 
  7. ParquetDataReader is created to read data from WriteAParquetFileFromUnstructuredJson.parquet.
  8. An array of objects is then printed on the console via Job.run() and StreamWriter class which takes System.out as a parameter to its constructor.

 

Console Output

=======================Reading Parquet File============================================
-----------------------------------------------
0 - Record (MODIFIED) {
    0:[symbol]:STRING=[MSFT]:String
    1:[exchange]:STRING=[NASDAQ]:String
    2:[price]:STRING=[31.67]:String
    3:[change]:STRING=[-0.17]:String
}

-----------------------------------------------
1 - Record (MODIFIED) {
    0:[symbol]:STRING=[ORCL]:String
    1:[exchange]:STRING=[NYSE]:String
    2:[price]:STRING=[32.75]:String
    3:[change]:STRING=[+0.40]:String
}

-----------------------------------------------
2 - Record (MODIFIED) {
    0:[symbol]:STRING=[ADBE]:String
    1:[exchange]:STRING=[NASDAQ]:String
    2:[price]:STRING=[47.70]:String
    3:[change]:STRING=[+0.42]:String
}

-----------------------------------------------
3 records
Mobile Analytics