Data Pipeline Features

Multiple Data Formats & Connectors

DataPipeline has a variety of endpoints to read and write data in different formats. These include databases, flat files, APIs, as well as your own custom connectors.

Flexible Schema

DataPipeline does not impose a particular structure on your data. Records can contain tabular data, hierarchical data, or any combination you need.

Record & Field-Level Metadata

Metadata can be added to records and fields as they flow through the pipeline. This can include anything you need, such as owner information, tags, or other business attributes.

Streaming XPath, XML, JSON

DataPipeline has a built-in streaming XPath engine that allows you to query very large XML and JSON files without loading the entire structure into memory.

JDBC Accelerators

Database loading can be sped up using a variety of multi-row write strategies, batch updates, and concurrent connections.

Filter, Validate, Transform

DataPipeline includes a variety of operators to select, clean, and massage data as it flows through your pipelines.

Split, Clone, Merge

Use the built-in or custom strategies to divide, duplicate, or merge streams of data.

De-Duplicate

Remove repeating data by looking at all values in each record or a subset of the field.

Lookups / Joins

Enrich and combine data from multiple sources regardless of format or structure.

Aggregation

Summarize the entire data stream or a subset continuously using sliding window aggregation. Configure the sliding windows using the built-in operators and policies or define your own.

Schema Validation

Validate and convert data using a target schema definition and a set of expression-based rules.

Data Mapping

Transform data from source to target using the expression language to map each field.

Decision Tables

Replace complicated decision logic with simple input-outcome grids and make them accessible to non-developers and configurable at runtime.

Decision Trees

Express complicated logic as simple flow charts to decouple decisioning logic from application code and make them accessible to non-technical users.

Pipelines

Create DataPipeline jobs declaratively by defining inputs, outputs, schemas, actions, and data mappings that can be saved and loaded from JSON and XML.

JDBC Metadata

Use the JDBC metadata classes to analyze your database structure, generate database code, or document your schemas.

Job Management

DataPipeline provides several ways to monitor and manage your jobs. Watch records are they flow by, handle completion and errors, pause, resume, and cancel as needed.

JMX Monitoring

Use the standard Java Management Extensions (JMX) to monitor and manage your pipelines and event buses.

Advanced Troubleshooting

Alongside stack-traces are key-value pairs including the current record being processed, the endpoints and operators, and the current state of the pipeline.

Failure Retry / Back-off

DataPipeline supports several retry strategies and handles various failure scenarios.

Meter and Throttle

Measure the speed and amount of data flowing through the pipeline. Use throttles to reduce data speeds to prevent saturating downstream systems.

Real-Time Reporting

Receive statistics and feedback from your pipelines as they are running.

Multi-Threaded Pipelines

Use separate threads to process parts of the pipeline in parallel, using all your CPU cores to complete work faster.

Expression language

Configure logic to map, transform, and filter at runtime with the built-in dynamic expression language.

Data Lineage

Add metadata to records and fields indicating where they were loaded from for audits, reconciliation, and troubleshooting. These include file names, line and column numbers, database queries, and more.

Event Bus

Deliver notifications asynchronously with the in-memory publish-subscribe event bus.

SQL DSL

Build dynamic SQL easily using a fluent API or domain specific language.