DataPipeline 10.0 Released

Welcome to the 10.0 release of DataPipeline.

This release adds a range of developer-focused enhancements, including the ability to read data tables from PDF documents. We’ve also added new APIs for accessing nested readers/writers, expanded FieldPath and Record capabilities, and introduced new options for Avro, Parquet, and PDF output.

Core Changes

New Reader/Writer access methods – DataReader.getReader(Class type) and DataWriter.getWriter(Class type) to retrieve the first nested reader/writer of the given type. Also added Job.getReader(Class type) and Job.getWriter(Class type) to similarly retrieve the first reader/writer of the given type in the job.
Expanded FieldPath API –
- FieldPath.parse(String fieldPathExpression, boolean failOnEmpty) for consistent path parsing.
- FieldPath.getValueNode(ValueNode, boolean create, boolean throwException) for targeted node retrieval.
- FieldPath.setValue(ValueNode, boolean create, boolean throwException, Object value) for value assignment.
- FieldPath.getNode(Node, boolean create, boolean throwException) for flexible navigation.
Record API improvements –
- Record.setField(FieldPath, boolean createField, boolean throwException, Object value)
- Record.setField(FieldPath, Object value)
- Record.setFieldNull(FieldPath, FieldType type)
Deduplication metrics – RemoveDuplicatesReader now offers getUniqueRecordCount() and getDuplicateRecordCount().
Excel reading optimization – ExcelReader.read() skips empty records when isSkipEmptyRows() is set.
Tests verified through JDK 24.

Foundation Changes

Column sample values – Added Column.setSampleValue(String sampleValue) for attaching sample data to schema definitions.

Integration Changes

Avro/Parquet sanitization – Added removeUnsupportedChars property to AvroWriter, AvroPipelineOutput, ParquetDataWriter, and ParquetPipelineOutput to automatically strip unsupported characters.
PDF output options – New pageSize, pageOrientation, and customStyle properties in PdfWriter and PdfPipelineOutput for better layout control.
PDF table extraction (New Feature) – Added support for reading structured data tables from PDFs via PdfDocument, PdfReader, and PdfPipelineInput.
Bug fix – PdfWriter.close() now correctly closes its underlying output stream.

Get started with DataPipeline 10.0 today to take advantage of the new PDF table-reading capability and the many enhancements across the core, foundations, and integration layers.

See the CHANGELOG for the full set of updates in DP 10.0.0.

Also see the JavaDocs and examples for more info.

Happy coding!

DataPipeline 10.0 Released

Core Changes

Foundation Changes

Integration Changes

About The DataPipeline Team

Leave a Reply Cancel reply

Data Pipeline

Docs

Company

Tools