Store Dataset on Disk
Updated: Jan 12, 2023
In this example you will learn how to store your dataset on disk using MvStoreDataset. MvStoreDataset catches the dataset's records on disk using MVStore.
This example be easily modified to show how to Store Dataset In Memory.
Input CSV file
Account,LastName,FirstName,Balance,CreditLimit,AccountCreated,Rating 101,Reeves,Keanu,9315.45,10000.00,1/17/1998,A 312,Butler,Gerard,90.00,1000.00,8/6/2003,B 868,Hewitt,Jennifer Love,0,17000.00,5/25/1985,B 761,Pinkett-Smith,Jada,49654.87,100000.00,12/5/2006,A 317,Murray,Bill,789.65,5000.00,2/5/2007,C
Java code listing
package com.northconcepts.datapipeline.foundations.examples.pipeline; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.csv.CSVReader; import com.northconcepts.datapipeline.foundations.pipeline.Pipeline; import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset; import com.northconcepts.datapipeline.foundations.pipeline.dataset.MvStoreDataset; import java.io.File; import java.nio.file.Files; public class StoreDatasetOnDisk { private static final File DATABASE_FOLDER = new File("example/data/output", "mvstore-dataset"); public static void main(String[] args) throws Throwable { DataReader reader = new CSVReader(new File("example/data/input/credit-balance-01.csv")) .setFieldNamesInFirstRow(true); Pipeline pipeline = new Pipeline().setInputAsDataReader(reader); File databaseFile = new File(DATABASE_FOLDER, "StoreDatasetOnDisk.mvstore"); if (databaseFile.exists()) { Files.delete(databaseFile.toPath()); } Dataset dataset = MvStoreDataset.createDataset(databaseFile, pipeline); dataset.setColumnStatsReaderThreads(1); dataset.load(); dataset.waitForColumnStatsToLoad(); dataset.close(); dataset = MvStoreDataset.openDataset(databaseFile); System.out.println(dataset.getRecord(0)); } }
Code walkthrough
DATABASE_FOLDER
is the folder where the dataset will be stored.- A CSVReader is created using the file path of the input file
credit-balance-01.csv
. - The
CSVReader.setFieldNamesInFirstRow(true)
method is invoked to specify that the names specified in the first row should be used as field names. - An instance of
Pipeline
is created which receives the reader as input (Pipeline().setInputAsDataReader(reader)
). databaseFile
contains the path to the storage location and the name of the dataset to be stored.databaseFile.exists()
is used to check if the dataset already exists in the storage location, if yes the code that follows deletes the file.MvStoreDataset.createDataset()
creates a persistent dataset on disk in the specified database file. The file will remain on disk even after the dataset is closed and the JVM exits.- The method accepts two parameters i.e.
databaseFile
which is the file to store the dataset's contents andpipeline
which is the source of the dataset's data. dataset.setColumnStatsReaderThreads(1)
specifies the number of threads to use to process column stats (default 2).dataset.load()
starts the asynchronous loading of records from the pipeline into this dataset. This method returns immediately and does not wait for loading to complete.dataset.waitForColumnStatsToLoad()
as the name suggests waits for all the stats to load before proceeding forward.dataset.close()
terminates any asynchronous data loading and column stats calculation.- To read the dataset from the disk
MvStoreDataset.openDataset(databaseFile)
is used. - Finally
dataset.getRecord(0)
which is the first row in the file is printed.
Console output
Record (MODIFIED) { 0:[Account]:STRING=[101]:String 1:[LastName]:STRING=[Reeves]:String 2:[FirstName]:STRING=[Keanu]:String 3:[Balance]:STRING=[9315.45]:String 4:[CreditLimit]:STRING=[10000.00]:String 5:[AccountCreated]:STRING=[1/17/1998]:String 6:[Rating]:STRING=[A]:String }