Dataset (Data Pipeline JavaDoc)

java.lang.Object
- com.northconcepts.datapipeline.foundations.core.Bean
- - com.northconcepts.datapipeline.foundations.core.FoundationObject
  - - com.northconcepts.datapipeline.foundations.pipeline.dataset.Dataset

All Implemented Interfaces:

DataExceptionContributor, JsonSerializable, RecordSerializable, XmlSerializable, Closeable, Serializable, AutoCloseable, Iterable<Record>

Direct Known Subclasses:

LocalFileDataset, MemoryDataset, MvStoreDataset
```
public abstract class Dataset
extends FoundationObject
implements Iterable<Record>, Closeable
```
The base class for caching records produced by a Pipeline or DataMappingPipeline. This class handles asynchronous loading and column metadata and statistics.

See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class Dataset.ColumnsDataReader

Nested Classes
Modifier and Type	Class and Description
`static class`	`Dataset.ColumnsDataReader`

Field Summary
- Fields inherited from class com.northconcepts.datapipeline.foundations.core.FoundationObject
  internalId, internalName, log, TIMESTAMP_FORMAT
- Fields inherited from interface com.northconcepts.datapipeline.core.RecordSerializable
  SERIALIZED_CLASS_NAME, TYPE
- Fields inherited from interface com.northconcepts.datapipeline.core.XmlSerializable
  XML_SERIALIZED_CLASS_NAME

Constructor Summary

Constructors
Constructor and Description

Dataset(AbstractPipeline pipeline)

Constructors
Constructor and Description
`Dataset(AbstractPipeline pipeline)`

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`protected Column`	`addField(Record record, Field field, int fieldIndex)`
`protected void`	`afterColumnStatsLoaded()` Called during the data loading process after all the column stats have been loaded.
`protected abstract void`	`afterLoad()` Called at the end of the data loading process after all the records and column stats have been loaded.
`protected void`	`afterRecordsLoaded()` Called during the data loading process after all the records have been loaded.
`protected abstract void`	`beforeLoad()` Called at the start of the data loading process, but before any records or column stats have been loaded.
`void`	`cancelLoad()` Gracefully terminate the asynchronous data loading and column stats calculation, waiting up to 10 seconds in total.
`void`	`close()`
`DataReader`	`createColumnsDataReader()`
`DataReader`	`createDataReader()` Reads all data cached in this Dataset.
`DataReader`	`createDataReader(long offset, int count)` Reads a subset of data cached in this Dataset.
`protected abstract DataWriter`	`createDataWriter()` Writes records to this dataset's cache after clearing it.
`protected void`	`finalize()`
`void`	`forEach(Consumer<? super Record> consumer)` Performs the given action for each record cached in this dataset.
`DataWriterPipelineOutput`	`fromRecord(Record source)` Loads this instance's state from a record and returns `this` (for fluid API call chaining).
`abstract Column`	`getColumn(int index)`
`abstract Column`	`getColumn(String name)`
`abstract long`	`getColumnCount()`
`abstract List<String>`	`getColumnNames()`
`abstract List<Column>`	`getColumns()`
`Throwable`	`getColumnStatsException()`
`int`	`getColumnStatsReaderThreads()` The number of threads to use to process column stats (default 2).
`Throwable`	`getDataLoadException()`
`Job`	`getJob()`
`Executor`	`getJobExecutor()` Returns the `Executor` used to run the internal `Job`.
`Long`	`getMaxColumnStatsRecords()` The number of records to use when calculating column stats or `null` for all records (default `null`).
`Integer`	`getMaxRecordsToLoad()` The maximum records to load parameter (`maxRecordsToLoad`) passed to the last call to `load(Integer)` or `load(Integer, JobCallback)`.
`protected abstract Column`	`getOrCreateColumn(String name, int index)`
`AbstractPipeline`	`getPipeline()`
`abstract Record`	`getRecord(long index)`
`abstract long`	`getRecordCount()`
`RecordList`	`getRecordList(long offset, int count)` Get a subset of the records cached in this dataset.
`boolean`	`isCollectUniqueValues()` Indicates if unique values in the dataset should be collected (default `false`).
`boolean`	`isColumnStatsLoaded()` Return `true` if all the column stats have been loaded.
`boolean`	`isDataLoaded()` Return `true` if all the records have been loaded and all the column stats have been loaded.
`boolean`	`isDataLoading()` Return `true` if the records or column stats are currently being loaded.
`boolean`	`isDetectBigNumberValues()` Indicates if big decimals and big integers should be analyzed to determine their precision and scale (default `true`).
`boolean`	`isDetectBooleanValues()` Indicates if boolean values should be looked for in strings and undefined types (default `true`).
`boolean`	`isDetectNumericValues()` Indicates if numeric values should be looked for in strings and undefined types (default `true`).
`boolean`	`isDetectTemporalValues()` Indicates if date/time patterns should be looked for in strings and undefined types (default `true`).
`boolean`	`isDetectUuidValues()` Indicates if UUID values should be looked for in strings and undefined types (default `true`).
`boolean`	`isInferStringTypes()` Indicates if string and undefined types should be analyzed to determine if they represent a numeric, boolean, or date/time value (default `true`).
`boolean`	`isRecordsLoaded()` Return `true` if all the records have been loaded.
`Iterator<Record>`	`iterator()`
`Dataset`	`load()` Starts the asynchronous loading of records from the pipeline into this dataset.
`Dataset`	`load(Integer maxRecords)` Starts the asynchronous loading of records from the pipeline into this dataset.
`Dataset`	`load(Integer maxRecordsToLoad, JobCallback<DataReader,DataWriter> callback)` Starts the asynchronous loading of records from the pipeline into this dataset.
`Dataset`	`setCollectUniqueValues(boolean collectUniqueValues)` Indicates if unique values in the dataset should be collected (default `false`).
`protected Dataset`	`setColumnStatsLoaded(boolean columnStatsLoaded)`
`Dataset`	`setColumnStatsReaderThreads(int columnStatsReaderThreads)` The number of threads to use to process column stats (default 2).
`Dataset`	`setDetectBigNumberValues(boolean detectBigNumberValues)` Indicates if big decimals and big integers should be analyzed to determine their precision and scale (default `true`).
`Dataset`	`setDetectBooleanValues(boolean detectBooleanValues)` Indicates if boolean values should be looked for in strings and undefined types (default `true`).
`Dataset`	`setDetectNumericValues(boolean detectNumericValues)` Indicates if numeric values should be looked for in strings and undefined types (default `true`).
`Dataset`	`setDetectTemporalValues(boolean detectTemporalValues)` Indicates if date/time patterns should be looked for in strings and undefined types (default `true`).
`Dataset`	`setDetectUuidValues(boolean detectUuidValues)` Indicates if UUID values should be looked for in strings and undefined types (default `true`).
`Dataset`	`setInferStringTypes(boolean inferStringTypes)` Indicates if string and undefined types should be analyzed to determine if they represent a numeric, boolean, or date/time value (default `true`).
`Dataset`	`setJobExecutor(Executor jobExecutor)` Sets the `Executor` used to run the internal `Job`.
`Dataset`	`setMaxColumnStatsRecords(Long maxColumnStatsRecords)` The number of records to use when calculating column stats or `null` for all records (default `null`).
`Dataset`	`setPipeline(AbstractPipeline pipeline)`
`protected Dataset`	`setRecordsLoaded(boolean recordsLoaded)`
`Stream<Record>`	`stream()` Returns a Stream over records cached in this dataset.
`Record`	`toRecord()`
`protected void`	`updateColumns(Record record, DataWriter writer)`
`Dataset`	`waitForColumnStatsToLoad()`
`Dataset`	`waitForColumnStatsToLoad(long minRecords, long maxWaitTimeMillis)`
`Dataset`	`waitForRecordsToLoad()`
`Dataset`	`waitForRecordsToLoad(long minRecords, long maxWaitTimeMillis)`
`Dataset`	`waitUntilJobFinished()`

Methods inherited from class com.northconcepts.datapipeline.foundations.core.FoundationObject
addExceptionProperties, assertValid, assertValid, clone, exception, exception, exception, getInternalId, getInternalName, resetInternalId

Methods inherited from class com.northconcepts.datapipeline.foundations.core.Bean
toString

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface java.lang.Iterable
spliterator

Methods inherited from interface com.northconcepts.datapipeline.core.RecordSerializable
fromArrayValue, fromArrayValue, fromArrayValue, fromJson, fromJson, newInstanceFromRecord, newInstanceFromRecord, toArrayValue, toJson, toJson, toJson, toRecord

Methods inherited from interface com.northconcepts.datapipeline.core.JsonSerializable
newInstanceFromJson

Methods inherited from interface com.northconcepts.datapipeline.core.XmlSerializable
forEachGrandChildElement, fromXml, fromXml, fromXmlElement, getAttribute, getAttribute, getAttribute, getAttribute, getAttribute, getAttribute, getAttribute, getAttribute, getAttribute, getAttribute, getChildElement, getChildElements, newInstanceFromXml, newInstanceFromXml, setAttribute, toXml, toXml, toXml, toXml, toXml, toXmlElement, writeXml, writeXml

Methods inherited from interface com.northconcepts.datapipeline.core.DataExceptionContributor
addExceptionProperties, contributeExceptionProperties, contributeExceptionProperties

- Constructor Detail
  - Dataset
```
public Dataset(AbstractPipeline pipeline)
```
- Method Detail
  - close
```
public void close()
```
    Specified by:
    
    close in interface Closeable
    
    Specified by:
    
    close in interface AutoCloseable
  - finalize
```
protected void finalize()
                 throws Throwable
```
    Overrides:
    
    finalize in class Object
    
    Throws:
    
    Throwable
  - getPipeline
```
public AbstractPipeline getPipeline()
```
  - setPipeline
```
public Dataset setPipeline(AbstractPipeline pipeline)
```
  - getRecordCount
```
public abstract long getRecordCount()
```
  - getRecord
```
public abstract Record getRecord(long index)
```
  - createDataReader
```
public DataReader createDataReader()
```
    Reads all data cached in this Dataset.
  - createDataReader
```
public DataReader createDataReader(long offset,
                                   int count)
```
    Reads a subset of data cached in this Dataset.
  - getRecordList
```
public RecordList getRecordList(long offset,
                                int count)
```
    Get a subset of the records cached in this dataset.
  - iterator
```
public Iterator<Record> iterator()
```
    Specified by:
    
    iterator in interface Iterable<Record>
  - forEach
```
public void forEach(Consumer<? super Record> consumer)
```
    Performs the given action for each record cached in this dataset.
    
    Specified by:
    
    forEach in interface Iterable<Record>
  - stream
```
public Stream<Record> stream()
```
    Returns a Stream over records cached in this dataset.
  - getColumnCount
```
public abstract long getColumnCount()
```
  - getColumnNames
```
public abstract List<String> getColumnNames()
```
  - getColumn
```
public abstract Column getColumn(int index)
```
  - getColumn
```
public abstract Column getColumn(String name)
```
  - getOrCreateColumn
```
protected abstract Column getOrCreateColumn(String name,
                                            int index)
```
  - getColumns
```
public abstract List<Column> getColumns()
```
  - getMaxColumnStatsRecords
```
public Long getMaxColumnStatsRecords()
```
    The number of records to use when calculating column stats or null for all records (default null).
  - setMaxColumnStatsRecords
```
public Dataset setMaxColumnStatsRecords(Long maxColumnStatsRecords)
```
    The number of records to use when calculating column stats or null for all records (default null).
  - isInferStringTypes
```
public boolean isInferStringTypes()
```
    Indicates if string and undefined types should be analyzed to determine if they represent a numeric, boolean, or date/time value (default true).
  - setInferStringTypes
```
public Dataset setInferStringTypes(boolean inferStringTypes)
```
    Indicates if string and undefined types should be analyzed to determine if they represent a numeric, boolean, or date/time value (default true).
  - isDetectTemporalValues
```
public boolean isDetectTemporalValues()
```
    Indicates if date/time patterns should be looked for in strings and undefined types (default true).
  - setDetectTemporalValues
```
public Dataset setDetectTemporalValues(boolean detectTemporalValues)
```
    Indicates if date/time patterns should be looked for in strings and undefined types (default true).
  - isDetectNumericValues
```
public boolean isDetectNumericValues()
```
    Indicates if numeric values should be looked for in strings and undefined types (default true).
  - setDetectNumericValues
```
public Dataset setDetectNumericValues(boolean detectNumericValues)
```
    Indicates if numeric values should be looked for in strings and undefined types (default true).
  - isDetectBooleanValues
```
public boolean isDetectBooleanValues()
```
    Indicates if boolean values should be looked for in strings and undefined types (default true).
  - setDetectBooleanValues
```
public Dataset setDetectBooleanValues(boolean detectBooleanValues)
```
    Indicates if boolean values should be looked for in strings and undefined types (default true).
  - isDetectBigNumberValues
```
public boolean isDetectBigNumberValues()
```
    Indicates if big decimals and big integers should be analyzed to determine their precision and scale (default true).
  - setDetectBigNumberValues
```
public Dataset setDetectBigNumberValues(boolean detectBigNumberValues)
```
    Indicates if big decimals and big integers should be analyzed to determine their precision and scale (default true).
  - isDetectUuidValues
```
public boolean isDetectUuidValues()
```
    Indicates if UUID values should be looked for in strings and undefined types (default true).
  - setDetectUuidValues
```
public Dataset setDetectUuidValues(boolean detectUuidValues)
```
    Indicates if UUID values should be looked for in strings and undefined types (default true).
  - isCollectUniqueValues
```
public boolean isCollectUniqueValues()
```
    Indicates if unique values in the dataset should be collected (default false).
  - setCollectUniqueValues
```
public Dataset setCollectUniqueValues(boolean collectUniqueValues)
```
    Indicates if unique values in the dataset should be collected (default false).
  - isDataLoading
```
public boolean isDataLoading()
```
    Return true if the records or column stats are currently being loaded.
  - isDataLoaded
```
public boolean isDataLoaded()
```
    Return true if all the records have been loaded and all the column stats have been loaded.
  - isRecordsLoaded
```
public boolean isRecordsLoaded()
```
    Return true if all the records have been loaded. The column stats might not have been loaded even when this method returns true since they require additional processing.
    
    See Also:
    
    isColumnStatsLoaded()
  - setRecordsLoaded
```
protected Dataset setRecordsLoaded(boolean recordsLoaded)
```
  - isColumnStatsLoaded
```
public boolean isColumnStatsLoaded()
```
    Return true if all the column stats have been loaded. The records would have already been loaded when this method is called since column stats require additional processing. See isRecordsLoaded()
  - setColumnStatsLoaded
```
protected Dataset setColumnStatsLoaded(boolean columnStatsLoaded)
```
  - getDataLoadException
```
public Throwable getDataLoadException()
```
  - getColumnStatsException
```
public Throwable getColumnStatsException()
```
  - getJobExecutor
```
public Executor getJobExecutor()
```
    Returns the Executor used to run the internal Job. The default implementation calls Job.runAsync().
  - setJobExecutor
```
public Dataset setJobExecutor(Executor jobExecutor)
```
    Sets the Executor used to run the internal Job. Passing in null will result in using the default implementation which calls Job.runAsync().
  - createColumnsDataReader
```
public DataReader createColumnsDataReader()
```
  - getJob
```
public Job getJob()
```
  - getColumnStatsReaderThreads
```
public int getColumnStatsReaderThreads()
```
    The number of threads to use to process column stats (default 2).
  - setColumnStatsReaderThreads
```
public Dataset setColumnStatsReaderThreads(int columnStatsReaderThreads)
```
    The number of threads to use to process column stats (default 2).
  - getMaxRecordsToLoad
```
public Integer getMaxRecordsToLoad()
```
    The maximum records to load parameter (maxRecordsToLoad) passed to the last call to load(Integer) or load(Integer, JobCallback). This value is null if load() was called last or null was passed to the other load methods.
  - beforeLoad
```
protected abstract void beforeLoad()
```
    Called at the start of the data loading process, but before any records or column stats have been loaded.
  - afterLoad
```
protected abstract void afterLoad()
```
    Called at the end of the data loading process after all the records and column stats have been loaded.
  - afterRecordsLoaded
```
protected void afterRecordsLoaded()
```
    Called during the data loading process after all the records have been loaded. The column stats are unlikely to have been loaded when this method is called since they require additional processing.
  - afterColumnStatsLoaded
```
protected void afterColumnStatsLoaded()
```
    Called during the data loading process after all the column stats have been loaded. The records would have already been loaded when this method is called since column stats require additional processing.
  - load
```
public Dataset load()
```
    Starts the asynchronous loading of records from the pipeline into this dataset. This method returns immediately and does not wait for loading to complete. See waitForRecordsToLoad() and waitForRecordsToLoad(long, long).
  - load
```
public Dataset load(Integer maxRecords)
```
    Starts the asynchronous loading of records from the pipeline into this dataset. This method returns immediately and does not wait for loading to complete. See waitForRecordsToLoad() and waitForRecordsToLoad(long, long).
    
    Parameters:
    
    maxRecords - the maximum records to load or null to load all records.
  - load
```
public Dataset load(Integer maxRecordsToLoad,
                    JobCallback<DataReader,DataWriter> callback)
```
    Starts the asynchronous loading of records from the pipeline into this dataset. This method returns immediately and does not wait for loading to complete. See waitForRecordsToLoad() and waitForRecordsToLoad(long, long).
    
    Parameters:
    
    maxRecordsToLoad - the maximum records to load or null to load all records.
    
    callback - the object to notify as data is being loaded.
  - cancelLoad
```
public void cancelLoad()
```
    Gracefully terminate the asynchronous data loading and column stats calculation, waiting up to 10 seconds in total.
  - waitForRecordsToLoad
```
public Dataset waitForRecordsToLoad(long minRecords,
                                    long maxWaitTimeMillis)
```
  - waitForRecordsToLoad
```
public Dataset waitForRecordsToLoad()
```
  - waitUntilJobFinished
```
public Dataset waitUntilJobFinished()
```
  - waitForColumnStatsToLoad
```
public Dataset waitForColumnStatsToLoad(long minRecords,
                                        long maxWaitTimeMillis)
```
  - waitForColumnStatsToLoad
```
public Dataset waitForColumnStatsToLoad()
```
  - createDataWriter
```
protected abstract DataWriter createDataWriter()
```
    Writes records to this dataset's cache after clearing it.
  - updateColumns
```
protected void updateColumns(Record record,
                             DataWriter writer)
```
  - addField
```
protected Column addField(Record record,
                          Field field,
                          int fieldIndex)
```
  - toRecord
```
public Record toRecord()
```
    Specified by:
    
    toRecord in interface RecordSerializable
    
    Overrides:
    
    toRecord in class Bean
  - fromRecord
```
public DataWriterPipelineOutput fromRecord(Record source)
```
    Description copied from interface: RecordSerializable
    
    Loads this instance's state from a record and returns this (for fluid API call chaining). For fluid API call chaining, the overridden method should change the declared return type to its class.
    
    Specified by:
    
    fromRecord in interface RecordSerializable
    
    Overrides:
    
    fromRecord in class Bean
    
    Returns:
    
    this instance.

Class Dataset

Nested Class Summary

Field Summary

Fields inherited from class com.northconcepts.datapipeline.foundations.core.FoundationObject

Fields inherited from interface com.northconcepts.datapipeline.core.RecordSerializable

Fields inherited from interface com.northconcepts.datapipeline.core.XmlSerializable

Constructor Summary

Method Summary

Methods inherited from class com.northconcepts.datapipeline.foundations.core.FoundationObject

Methods inherited from class com.northconcepts.datapipeline.foundations.core.Bean

Methods inherited from class java.lang.Object

Methods inherited from interface java.lang.Iterable

Methods inherited from interface com.northconcepts.datapipeline.core.RecordSerializable

Methods inherited from interface com.northconcepts.datapipeline.core.JsonSerializable

Methods inherited from interface com.northconcepts.datapipeline.core.XmlSerializable

Methods inherited from interface com.northconcepts.datapipeline.core.DataExceptionContributor

Constructor Detail

Dataset

Method Detail

close

finalize

getPipeline

setPipeline

getRecordCount

getRecord

createDataReader

createDataReader

getRecordList

iterator

forEach

stream

getColumnCount

getColumnNames

getColumn

getColumn

getOrCreateColumn

getColumns

getMaxColumnStatsRecords

setMaxColumnStatsRecords

isInferStringTypes

setInferStringTypes

isDetectTemporalValues

setDetectTemporalValues

isDetectNumericValues

setDetectNumericValues

isDetectBooleanValues

setDetectBooleanValues

isDetectBigNumberValues

setDetectBigNumberValues

isDetectUuidValues

setDetectUuidValues

isCollectUniqueValues

setCollectUniqueValues

isDataLoading

isDataLoaded

isRecordsLoaded

setRecordsLoaded

isColumnStatsLoaded

setColumnStatsLoaded

getDataLoadException

getColumnStatsException

getJobExecutor

setJobExecutor

createColumnsDataReader

getJob

getColumnStatsReaderThreads

setColumnStatsReaderThreads

getMaxRecordsToLoad

beforeLoad

afterLoad

afterRecordsLoaded

afterColumnStatsLoaded

load

load

load

cancelLoad

waitForRecordsToLoad

waitForRecordsToLoad

waitUntilJobFinished

waitForColumnStatsToLoad