Convert Unstructured JSON to Parquet
Updated: Aug 22, 2023
This example shows how to convert unstructured JSON data into the Parquet format. By converting JSON to Parquet, you can benefit from improved query performance, reduced storage space, and enhanced compatibility with data analytics platforms. This can be used in big data processing, data warehousing, and business intelligence applications.
As the title mentions, this example specifically shows how to convert complex JSON files. You can view this example for JSON files with the same-structured objects.
Input JSON file
[ { "id": "358464", "t": "MSFT", "e": "NASDAQ", "l": "31.67", "l_cur": "31.67", "s": "2", "ltt": "4:00PM EDT", "lt": "Aug 1, 4:00PM EDT", "c": "-0.17", "cp": "-0.53", "ccol": "chr", "el": "31.69", "el_cur": "31.69", "elt": "Aug 1, 7:54PM EDT", "ec": "+0.02", "ecp": "0.06", "eccol": "chg", "div": "0.23", "yld": "2.90" }, { "id": "419344", "t": "ORCL", "e": "NYSE", "l": "32.75", "l_cur": "32.75", "s": "2", "ltt": "4:00PM EDT", "lt": "Aug 1, 4:00PM EDT", "c": "+0.40", "cp": "1.24", "ccol": "chg", "el": "32.70", "el_cur": "32.70", "elt": "Aug 1, 7:15PM EDT", "ec": "-0.05", "ecp": "-0.15", "eccol": "chr", "div": "", "yld": "1.47" }, { "id": "4112", "t": "ADBE", "e": "NASDAQ", "l": "47.70", "l_cur": "47.70", "s": "0", "ltt": "4:00PM EDT", "lt": "Aug 1, 4:00PM EDT", "c": "+0.42", "cp": "0.89", "ccol": "chg" } ]
Java Code Listing
package com.northconcepts.datapipeline.examples.cookbook; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.StreamWriter; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.json.JsonReader; import com.northconcepts.datapipeline.parquet.ParquetDataReader; import com.northconcepts.datapipeline.parquet.ParquetDataWriter; import java.io.File; public class ConvertUnstructuredJsonToParquet { private static final File PARQUET_FILE = new File("example/data/output/WriteAParquetFileFromUnstructuredJson.parquet"); public static void main(String[] args) { DataReader reader = new JsonReader(new File("example/data/input/finance.json")) .addField("symbol", "//array/object/t") .addField("exchange", "//array/object/e") .addField("price", "//array/object/l") .addField("change", "//array/object/c") .addRecordBreak("//array/object"); ParquetDataWriter writer = new ParquetDataWriter(PARQUET_FILE); Job.run(reader, writer); System.out.println("=======================Reading Parquet File============================================"); reader = new ParquetDataReader(PARQUET_FILE); Job.run(reader, new StreamWriter(System.out)); } }
Code Walkthrough
- A new JsonReader is created to specify the input file
finance.json
. - The fields to be read from the JSON file are declared via the addField() method.
- Since the JSON file is an array of objects, to access data in each field we must first traverse the array:
addField("exchange", "//array/object/e")
. This method traverses through the objects and searches for a field with the namee
, and returns the value to the fieldexchange
. - The JsonReader.addRecordBreak method is invoked to demarcate records.
- ParquetDataWriter is created to write records to the specified Parquet file
WriteAParquetFileFromUnstructuredJson.parquet
. - Job.run(reader, writer) is used to transfer the data from the JSON file to Parquet. See how to compile and run data pipeline jobs.
- ParquetDataReader is created to read data from
WriteAParquetFileFromUnstructuredJson.parquet
. - An array of objects is then printed on the console via
Job.run()
and StreamWriter class which takesSystem.out
as a parameter to its constructor.
Console Output
=======================Reading Parquet File============================================ ----------------------------------------------- 0 - Record (MODIFIED) { 0:[symbol]:STRING=[MSFT]:String 1:[exchange]:STRING=[NASDAQ]:String 2:[price]:STRING=[31.67]:String 3:[change]:STRING=[-0.17]:String } ----------------------------------------------- 1 - Record (MODIFIED) { 0:[symbol]:STRING=[ORCL]:String 1:[exchange]:STRING=[NYSE]:String 2:[price]:STRING=[32.75]:String 3:[change]:STRING=[+0.40]:String } ----------------------------------------------- 2 - Record (MODIFIED) { 0:[symbol]:STRING=[ADBE]:String 1:[exchange]:STRING=[NASDAQ]:String 2:[price]:STRING=[47.70]:String 3:[change]:STRING=[+0.42]:String } ----------------------------------------------- 3 records