DataPipeline 10.0 Released

Welcome to the 10.0 release of DataPipeline.

This release adds a range of developer-focused enhancements, including the ability to read data tables from PDF documents. We’ve also added new APIs for accessing nested readers/writers, expanded FieldPath and Record capabilities, and introduced new options for Avro, Parquet, and PDF output.

Core Changes

  1. New Reader/Writer access methods – DataReader.getReader(Class type) and DataWriter.getWriter(Class type) to retrieve the first nested reader/writer of the given type. Also added Job.getReader(Class type) and Job.getWriter(Class type) to similarly retrieve the first reader/writer of the given type in the job.
  2. Expanded FieldPath API
    • FieldPath.parse(String fieldPathExpression, boolean failOnEmpty) for consistent path parsing.
    • FieldPath.getValueNode(ValueNode, boolean create, boolean throwException) for targeted node retrieval.
    • FieldPath.setValue(ValueNode, boolean create, boolean throwException, Object value) for value assignment.
    • FieldPath.getNode(Node, boolean create, boolean throwException) for flexible navigation.
  3. Record API improvements
    • Record.setField(FieldPath, boolean createField, boolean throwException, Object value)
    • Record.setField(FieldPath, Object value)
    • Record.setFieldNull(FieldPath, FieldType type)
  4. Deduplication metrics – RemoveDuplicatesReader now offers getUniqueRecordCount() and getDuplicateRecordCount().
  5. Excel reading optimization – ExcelReader.read() skips empty records when isSkipEmptyRows() is set.
  6. Tests verified through JDK 24.

 

Foundation Changes

  1. Column sample values – Added Column.setSampleValue(String sampleValue) for attaching sample data to schema definitions.

 

Integration Changes

  1. Avro/Parquet sanitization – Added removeUnsupportedChars property to AvroWriter, AvroPipelineOutput, ParquetDataWriter, and ParquetPipelineOutput to automatically strip unsupported characters.
  2. PDF output options – New pageSize, pageOrientation, and customStyle properties in PdfWriter and PdfPipelineOutput for better layout control.
  3. PDF table extraction (New Feature) – Added support for reading structured data tables from PDFs via PdfDocument, PdfReader, and PdfPipelineInput.
  4. Bug fix – PdfWriter.close() now correctly closes its underlying output stream.

 

Get started with DataPipeline 10.0 today to take advantage of the new PDF table-reading capability and the many enhancements across the core, foundations, and integration layers.

See the CHANGELOG for the full set of updates in DP 10.0.0.

Also see the JavaDocs and examples for more info.

Happy coding!

About The DataPipeline Team

We make Data Pipeline — a lightweight ETL framework for Java. Use it to filter, transform, and aggregate data on-the-fly in your web, mobile, and desktop apps. Learn more about it at northconcepts.com.

Leave a Reply

Your email address will not be published. Required fields are marked *
You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">