Spring Batch vs Data Pipeline – ETL Job Example

Data Pipeline vs Spring Batch

I was reading a blog at Java Code Geeks on how to create a Spring Batch ETL Job.  What struck me about the example was the amount of code required by the framework for such a routine task.  In this blog, you’ll see how to accomplish the same task of summarize a million stock trades to find the open, close, high, and low prices for each symbol using our Data Pipeline framework.

You can find the Spring Batch ETL job example that this blog is based on here:


1. Read trades from a CSV file

The first step is to load the million trade records from a CSV file.

Don’t worry, you’re not loading all million records into memory at once.  Data Pipeline is a streaming framework like java.io.InputStream.  Each call to DataReader.read() fetches the next record from the file and passes it onto the next step in your pipeline.  This single-piece-flow approach allows you to read huge data streams with little memory overhead.  It also allows you to handle batch and streaming data using a single API.

CSV Reader

The first thing you’ll notice that’s different from the Spring Batch example is that CSVReader knows how to read column names from the stream.  If your input file ever added new fields like date or if the positions of the columns changed, unlike the Spring example this code would not need to change.

Flexible Records

Another thing you’ll notice is that no custom Java bean is required.  The Record class built-into Data Pipeline can handle tabular data (like that found in CSV or Excel files), hierarchical data (like XML and JSON), or any combination of the two.  This saves you from having to create new classes to hold the data at each step in your pipeline.  It also saves you from having to change those classes when your data stream changes.

2. Convert text fields

The second step is to convert the incoming price field to a double.

This step might not be needed for other data sources (like Excel or JDBC).  However, values coming from CSV files always enter into the pipeline as text.

3. Summarize price using aggregate operators

The third step applies the Data Pipeline version of a SQL group-by operator to group and summarize records by stock symbol.

Like SQL, the resulting data stream will contain a unique row for each distinct value combination of the group-by fields passed to the constructor.  The remaining fields will contain the result of the specified group operation (like sum or count).

For your convenience, Data Pipeline provides methods like first(), last(), min(), and max() to apply common group operations.  However, you can add your own custom group operators as well as employ advanced features like sliding windows to the stream.

You can read more about data aggregation in the user guide.

4. Write results to a new CSV file

The last step in the pipeline is to write the results to a new CSV file.

Unlike the Spring Batch example, you don’t need to listen for job completion or write out the target format yourself — unless you really want to.

5. Run the pipeline job

Once your pipeline is configured by chaining together readers and writers, you can simultaneously create and run your job using a convenience method on the Job class.

The job class has many hidden features, like:

  • multi-threading
  • callbacks and listeners
  • management functions: cancel, pause, resume
  • JMX monitoring
  • logging and progress monitoring

See the running jobs section of the user guide for more details.


The Spring Framework is a great tool that I use all the time, but it’s not always the right tool for the job.  In some cases, like with Spring Batch, you can end up with a lot of code just because that’s what the framework needs.  You can also end up with software that breaks when small changes are introduced — like when a new field is added to the input file or a new arrangement input fields is released.

I hope this example shows you another perspective on programming batch ETL jobs in Java.  Your batch jobs don’t need to be code heavy as long as the framework you’re using is designed with simplicity mind.



We’re working on a web tool to make it easier for you to schedule and run your ETL jobs. Please download the early access build and give us your feedback and feature requests.


Happy Coding!



About Dele Taylor

We make Data Pipeline — a lightweight ETL framework for Java. Use it to filter, transform, and aggregate data on-the-fly in your web, mobile, and desktop apps. Learn more about it at northconcepts.com.

Leave a Reply

Your email address will not be published. Required fields are marked *
You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">