One approach to increase record throughput is to decouple the reading of data from the rest of the pipeline.
If you're working with a particularly slow data source, it might not make sense to stop reading from it while the rest of the pipeline executes.
For this reason Data Pipeline provides an
AsyncReader that eagerly reads data from other readers into a buffer using a separate thread.
Downstream requests for data can be satisfied directly from the buffer without blocking (waiting). Blocking only occurs if the buffer is empty when a downstream request for data is made.
This example eagerly buffers up to 1 MB of data from a CSV file in parallel to the pipeline.
AsyncReader can read from any other
DataReader, you can place if anywhere in a pipeline
you feel will benefit from parallel processing.
This example adds a second thread to ensure the filter step happens in parallel with the CSV reading and the rest of the pipeline.