Data Pipeline Features
Multiple Data Formats & Connectors
DataPipeline has a variety of endpoints to read and write data in different formats. These include databases, flat files, APIs, as well as your own custom connectors.
Flexible Schema
DataPipeline does not impose a particular structure on your data. Records can contain tabular data, hierarchical data, or any combination you need.
Record & Field-Level Metadata
Metadata can be added to records and fields as they flow through the pipeline. This can include anything you need, such as owner information, tags, or other business attributes.
Streaming XPath, XML, JSON
DataPipeline has a built-in streaming XPath engine that allows you to query very large XML and JSON files without loading the entire structure into memory.
JDBC Accelerators
Database loading can be sped up using a variety of multi-row write strategies, batch updates, and concurrent connections.
Filter, Validate, Transform
DataPipeline includes a variety of operators to select, clean, and massage data as it flows through your pipelines.
Split, Clone, Merge
Use the built-in or custom strategies to divide, duplicate, or merge streams of data.
De-Duplicate
Remove repeating data by looking at all values in each record or a subset of the field.
Lookups / Joins
Enrich and combine data from multiple sources regardless of format or structure.
Aggregation
Summarize the entire data stream or a subset continuously using sliding window aggregation. Configure the sliding windows using the built-in operators and policies or define your own.
Schema Validation
Validate and convert data using a target schema definition and a set of expression-based rules.
Data Mapping
Transform data from source to target using the expression language to map each field.
Decision Tables
Replace complicated decision logic with simple input-outcome grids and make them accessible to non-developers and configurable at runtime.
Decision Trees
Express complicated logic as simple flow charts to decouple decisioning logic from application code and make them accessible to non-technical users.
Pipelines
Create DataPipeline jobs declaratively by defining inputs, outputs, schemas, actions, and data mappings that can be saved and loaded from JSON and XML.
JDBC Metadata
Use the JDBC metadata classes to analyze your database structure, generate database code, or document your schemas.
Job Management
DataPipeline provides several ways to monitor and manage your jobs. Watch records are they flow by, handle completion and errors, pause, resume, and cancel as needed.
JMX Monitoring
Use the standard Java Management Extensions (JMX) to monitor and manage your pipelines and event buses.
Advanced Troubleshooting
Alongside stack-traces are key-value pairs including the current record being processed, the endpoints and operators, and the current state of the pipeline.
Failure Retry / Back-off
DataPipeline supports several retry strategies and handles various failure scenarios.
Meter and Throttle
Measure the speed and amount of data flowing through the pipeline. Use throttles to reduce data speeds to prevent saturating downstream systems.
Real-Time Reporting
Receive statistics and feedback from your pipelines as they are running.
Multi-Threaded Pipelines
Use separate threads to process parts of the pipeline in parallel, using all your CPU cores to complete work faster.
Expression language
Configure logic to map, transform, and filter at runtime with the built-in dynamic expression language.
Data Lineage
Add metadata to records and fields indicating where they were loaded from for audits, reconciliation, and troubleshooting. These include file names, line and column numbers, database queries, and more.
Event Bus
Deliver notifications asynchronously with the in-memory publish-subscribe event bus.
SQL DSL
Build dynamic SQL easily using a fluent API or domain specific language.