Asynchronous Reading
One approach to increase record throughput is to decouple the reading of data from the rest of the pipeline.
If you're working with a particularly slow data source, it might not make sense to stop reading from it while the rest of the pipeline executes.
For this reason Data Pipeline provides an AsyncReader
that eagerly reads data from other readers into a buffer using a separate thread.
Downstream requests for data can be satisfied directly from the buffer without blocking (waiting). Blocking only occurs if the buffer is empty when a downstream request for data is made.
This example eagerly buffers up to 1 MB of data from a CSV file in parallel to the pipeline.
Since AsyncReader
can read from any other DataReader
, you can place if anywhere in a pipeline
you feel will benefit from parallel processing.
This example adds a second thread to ensure the filter step happens in parallel with the CSV reading and the rest of the pipeline.