Data Pipeline Features
Multiple Data Formats & Connectors
DataPipeline has a variety of endpoints to read and write data in different formats. These include databases, flat files, APIs, as well as your own custom connectors.
DataPipeline does not impose a particular structure on your data. Records can contain tabular data, hierarchical data, or any combination you need.
Record & Field-Level Metadata
Metadata can be added to records and fields as they flow through the pipeline. This can include anything you need, such as owner information, tags, or other business attributes.
Streaming XPath, XML, JSON
DataPipeline has a built-in streaming XPath engine that allows you to query very large XML and JSON files without loading the entire structure into memory.
Database loading can be sped up using a variety of multi-row write strategies, batch updates, and concurrent connections.
Filter, Validate, Transform
DataPipeline includes a variety of operators to select, clean, and massage data as it flows through your pipelines.
Split, Clone, Merge
Use the built-in or custom strategies to divide, duplicate, or merge streams of data.
Remove repeating data by looking at all values in each record or a subset of the field.
Lookups / Joins
Enrich and combine data from multiple sources regardless of format or structure.
Summarize the entire data stream or a subset continuously using sliding window aggregation. Configure the sliding windows using the built-in operators and policies or define your own.
Validate and convert data using a target schema definition and a set of expression-based rules.
Transform data from source to target using the expression language to map each field.
Replace complicated decision logic with simple input-outcome grids and make them accessible to non-developers and configurable at runtime.
Express complicated logic as simple flow charts to decouple decisioning logic from application code and make them accessible to non-technical users.
Create DataPipeline jobs declaratively by defining inputs, outputs, schemas, actions, and data mappings that can be saved and loaded from JSON and XML.
Use the JDBC metadata classes to analyze your database structure, generate database code, or document your schemas.
DataPipeline provides several ways to monitor and manage your jobs. Watch records are they flow by, handle completion and errors, pause, resume, and cancel as needed.
Use the standard Java Management Extensions (JMX) to monitor and manage your pipelines and event buses.
Alongside stack-traces are key-value pairs including the current record being processed, the endpoints and operators, and the current state of the pipeline.
Failure Retry / Back-off
DataPipeline supports several retry strategies and handles various failure scenarios.
Meter and Throttle
Measure the speed and amount of data flowing through the pipeline. Use throttles to reduce data speeds to prevent saturating downstream systems.
Receive statistics and feedback from your pipelines as they are running.
Use separate threads to process parts of the pipeline in parallel, using all your CPU cores to complete work faster.
Configure logic to map, transform, and filter at runtime with the built-in dynamic expression language.
Add metadata to records and fields indicating where they were loaded from for audits, reconciliation, and troubleshooting. These include file names, line and column numbers, database queries, and more.
Deliver notifications asynchronously with the in-memory publish-subscribe event bus.
Build dynamic SQL easily using a fluent API or domain specific language.