Store Dataset on Disk

Updated: Jan 12, 2023

In this example you will learn how to store your dataset on disk using MvStoreDataset. MvStoreDataset catches the dataset's records on disk using MVStore.

This example be easily modified to show how to Store Dataset In Memory.

Input CSV file

Account,LastName,FirstName,Balance,CreditLimit,AccountCreated,Rating
101,Reeves,Keanu,9315.45,10000.00,1/17/1998,A
312,Butler,Gerard,90.00,1000.00,8/6/2003,B
868,Hewitt,Jennifer Love,0,17000.00,5/25/1985,B
761,Pinkett-Smith,Jada,49654.87,100000.00,12/5/2006,A
317,Murray,Bill,789.65,5000.00,2/5/2007,C

Java code listing

package com.northconcepts.datapipeline.foundations.examples.pipeline;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.csv.CSVReader;
import com.northconcepts.datapipeline.foundations.pipeline.Pipeline;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset;
import com.northconcepts.datapipeline.foundations.pipeline.dataset.MvStoreDataset;

import java.io.File;
import java.nio.file.Files;

public class StoreDatasetOnDisk {
    private static final File DATABASE_FOLDER = new File("example/data/output", "mvstore-dataset");

    public static void main(String[] args) throws Throwable {
        DataReader reader = new CSVReader(new File("example/data/input/credit-balance-01.csv"))
                .setFieldNamesInFirstRow(true);
        
        Pipeline pipeline = new Pipeline().setInputAsDataReader(reader);

        File databaseFile = new File(DATABASE_FOLDER, "StoreDatasetOnDisk.mvstore");

        if (databaseFile.exists()) {
            Files.delete(databaseFile.toPath());
        }

        Dataset dataset = MvStoreDataset.createDataset(databaseFile, pipeline);

        dataset.setColumnStatsReaderThreads(1);
        dataset.load();
        dataset.waitForColumnStatsToLoad();

        dataset.close();
        
        dataset = MvStoreDataset.openDataset(databaseFile);

        System.out.println(dataset.getRecord(0));

    }

}

Code walkthrough

DATABASE_FOLDER is the folder where the dataset will be stored.
A CSVReader is created using the file path of the input file credit-balance-01.csv.
The CSVReader.setFieldNamesInFirstRow(true) method is invoked to specify that the names specified in the first row should be used as field names.
An instance of Pipeline is created which receives the reader as input (Pipeline().setInputAsDataReader(reader)).
databaseFile contains the path to the storage location and the name of the dataset to be stored.
databaseFile.exists() is used to check if the dataset already exists in the storage location, if yes the code that follows deletes the file.
MvStoreDataset.createDataset() creates a persistent dataset on disk in the specified database file. The file will remain on disk even after the dataset is closed and the JVM exits.
The method accepts two parameters i.e. databaseFile which is the file to store the dataset's contents and pipeline which is the source of the dataset's data.
dataset.setColumnStatsReaderThreads(1) specifies the number of threads to use to process column stats (default 2).
dataset.load() starts the asynchronous loading of records from the pipeline into this dataset. This method returns immediately and does not wait for loading to complete.
dataset.waitForColumnStatsToLoad() as the name suggests waits for all the stats to load before proceeding forward.
dataset.close() terminates any asynchronous data loading and column stats calculation.
To read the dataset from the disk MvStoreDataset.openDataset(databaseFile) is used.
Finally dataset.getRecord(0) which is the first row in the file is printed.

Console output

Record (MODIFIED) {
    0:[Account]:STRING=[101]:String
    1:[LastName]:STRING=[Reeves]:String
    2:[FirstName]:STRING=[Keanu]:String
    3:[Balance]:STRING=[9315.45]:String
    4:[CreditLimit]:STRING=[10000.00]:String
    5:[AccountCreated]:STRING=[1/17/1998]:String
    6:[Rating]:STRING=[A]:String
}

All Tags | All Examples

Store Dataset on Disk

Input CSV file

Java code listing

Code walkthrough

Console output

Data Pipeline

Docs

Company

Tools