DataPipeline 8.0 Released

 

Late December we released DataPipeline version 8.0.0 to general availability.  This might be our longest list of new features and changes yet.  Let’s dive in.

Support for java.time.Instant

DataPipeline now converts Instant values to DATETIME fields instead of leaving them as UNDEFINED.  Previously, Instant values would be left untouched and set as FieldType.UNDEFINED.  This change means they will be saved as java.util.Date internally.  Here are the related changes:

  1. ArrayValue now has getValueAsInstant(int) to retrieve date-time values as Instant.
  2. Field now has getValueAsInstant() and setValue(Instant value) to convert Instant to and from date-time values.
  3. Record now has getFieldValueAsInstant(String fieldPath, Instant defaultValue) to retrieve date-time values as Instant.
  4. SingleValue now has getValueAsInstant() to retrieve date-time values as Instant.
  5. ValueNode now converts java.time.Instant values to DATETIME values (java.util.Date).
  6. Added BasicFieldTransformer.nullToValue(final Instant value).
  7. SetField now supports java.time.Instant.

AsyncTaskReader

We added AsyncTaskReader as a convenient way to apply work to a DataReader using multiple threads. See Process Incoming Data on Multiple Threads in the user guide to learn more.

DataPipeline Expression Language (DPEL)

The DataPipeline Expression Language now contains a reference to “this” for scenarios when you need access to the expression context (or current record).  Normally, referencing a field that does not exist will throw an exception, however, several methods (like recordContainsNonNullValue()) can test for fields using their string names.  See the Validate Root Record Using Schema code example to learn more.

 

JDBC Improvements

  1. The JdbcReader class now has an improved algorithm to map and read database values to Java. The new default algorithm is OPINIONATED and will be used when no algorithm is specified.  Learn more about the options for JDBC mappings and see examples on the JDBC Type Mapping
  2. JdbcConnection now relies on JdbcValueReader to map SQL types to Java (defaulting to JdbcValueReader.OPINIONATED).
  3. Improved RDBMS metadata classes by adding:
    1. JdbcIndex, JdbcFieldIndex, and JdbcQueryParameter to the northconcepts.datapipeline.foundations.jdbc package to model more RDBMS metadata.
    2. Parameters and cardinality to JdbcQuery.
    3. JavaType, className, and signed to JdbcQueryColumn.
    4. ClassName, signed, isNumericSqlType(), isBooleanSqlType(), and isTemporalSqlType() to JdbcTableColumn.
    5. JdbcResultPage for use in generated DAO classes for tables, views, and queries.
    6. Indexes to JdbcTable.

 

Streaming JSON Lines Writer

This release adds support for reading and writing the JSON Lines steaming format.  You can see JSON Lines examples and read about the JSON Lines format.

 

Attributes and Tags

This release includes the ability to add arbitrary data in the form of key-value attributes and classifiers in the form of tags to many DataPipeline Foundations objects.  These include SchemaDef, EntityDef, DataMapping, and many more classes.  See the Add Attributes and Tags to Schemas example to learn more.

 

Problem Detection in Declarative Models

As part of the push towards declarative models, you can now detect common problems in specific models.  For example: duplicate fields, missing related entities, invalid array boundaries, and more.  This feature can be used in interactive applications to present model issues to users as they occur.  See the Show Schema Model Problems and Show Data Mapping Model Problems examples to see it in action.

 

Diff Package

This release adds a new package to model the differences between two objects and their children.  The new com.northconcepts.datapipeline.foundations.difference package contains the framework classes and the RecordDiff implementation to compare Records.  This release also includes a second com.northconcepts.datapipeline.foundations.schema.diff package to compare classes in the schema package.  See Compare Schemas Using Diff for an example of diff in action.

 

Date and Time Pattern Detection

In addition to DataPipeline’s ability to detect the date-time patterns in strings, it now has DataTimePatternMatch.fieldType to indicate if the match is a DATETIME, DATE, or TIME.  See Detect Date and Time Patterns in Strings for an example of it in use.

 

Number Detection

You can now analyze numbers in strings with a new NumberDetector class.  Unlike simply parsing a number, this class let’s you determine if the number is real or whole as well as it’s precision, scale, and sign.  See Detect Numbers in Strings for an example of its usage.

 

Dataset Management and Analysis

The Dataset analysis function has been improved with the following changes:

  1. Dataset analysis now includes arrayValueCount, minimumArrayElements, maximumArrayElements, numberDescriptor, bigNumberDescriptor, getBestFitFieldType(), and getInferredFieldType() in each Column.
  2. You can now disable detectTemporalValues, detectNumericValues, detectBooleanValues, detectBigNumberValues, collectUniqueValues to streamline the amount of analysis work performed.
  3. Dataset is now implements Iterable<Record> for easy use in enhanced for loops.
  4. MvStoreDataset now has static convenience methods to create and open datasets on disk: createTempDataset(AbstractPipeline pipeline), createTempDataset(File databaseFolder, AbstractPipeline pipeline), createDataset(File databaseFile, AbstractPipeline pipeline), openDataset(File databaseFile).
  5. You can now set the MvStoreDataset.deleteDatabaseFileOnClose flag for automatic dataset cleanup.

 

Schema Changes

  1. You can now generate Java code from the schema classes using methods like SchemaDef.generateJavaCode(JavaCodeBuilder) and EntityDef.generateJavaCode(JavaCodeBuilder).
  2. An EntityDef can now extend another EntityDef in the same schema (is-a relationship). See the Validate Data using Entity Inheritance
  3. The schema package can now better model relational databases:
    1. EntityDef can now have indexes like a database table. See example of how to Add Indexes to Entities.
    2. SchemaDef can now have EntityRelationshipDefs to connect tables like relationships in a RDBMS. See the Add Relationship Between Entities
    3. FieldDef can now model primary keys and arrays. See the Add Primary Keys to Entities and Add Array Fields to Entities
  4. NumericFieldDef now validates numeric precision and scale.
  5. Added RecordFieldDef to model complex, nested structures (has-a and uses-a relationships). See the Add Nested Entities as Fields
  6. TemporalFieldDef now uses DateTimeFormatter instead of SimpleDateFormat.
  7. Added lenientPattern (default true) to indicate if date-time parsing should allow inputs that fall outside their allowed ranges (for example February 30, 2017).
  8. allowBlank now defaults to true.
  9. ValidationMessage now includes a stacktrace field.

 

Jira Changes

  1. Added JiraService to ease working with the JiraClient (REST API).
  2. The JiraClient (and JiraService) was expanded to create, update, delete, and transition issues. See the Transition Jira Issue
  3. The client now also includes methods to create and retrieve issue comments. See the Jira comment examples.
  4. JiraSearch can now search by a set of Jira IDs. See the Search for Jira Issues by Multiple IDs

 

Mailchimp Changes

Mailchimp now includes the archived status to the existing subscribed, unsubscribed, cleaned, and pending statuses.  See Mailchimp examples.

 

Parquet Changes

  1. ParquetDataWriter now analyzes the entire dataset to generate its schema when no schema is explicitly provided. See it in the Write a Parquet file using schema from data
  2. Added columnStatsReaderThreads to set the number of background threads to use analyzing values to generate the schema (default is 2).
  3. Added defaultBigNumberPrecision to set the schema for writing BigDecimal and BigInteger values that are null.
  4. Added GenerateSchemaDefFromParquet tool generate a SchemaDef from a Parquet file or schema.

 

Bugfixes

  1. AsyncMultiReader no longer blocks when repeatedly attempting to read past the last record.
  2. moveField(), moveFieldBefore(), moveFieldAfter() no longer decrements the target index if it’s less than the source index.
  3. ExcelReader now respects the startingColumn property.
  4. SimpleJsonReader no longer skips fields containing only empty string.
  5. RetryingReader and RetryingWriter now retries up to maxRetryCount instead of maxRetryCount-1.
  6. The Java code generator in DataMapping now generates escaped strings for non-printable characters
  7. ExcelPipelineInput now escapes the sheet name in generated Java code
  8. FixedWidthPipelineInput now escapes the field names in generated Java code
  9. The Java code generator in DataMapping now generates escaped strings for non-printable characters
  10. ExcelPipelineInput now escapes the sheet name in generated Java code
  11. FixedWidthPipelineInput now escapes the field names in generated Java code

Other Changes

See the CHANGELOG for the full set of updates in DP 8.0.0.

About The DataPipeline Team

We make Data Pipeline — a lightweight ETL framework for Java. Use it to filter, transform, and aggregate data on-the-fly in your web, mobile, and desktop apps. Learn more about it at northconcepts.com.

Leave a Reply

Your email address will not be published. Required fields are marked *
You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">