ETL is a process for performing data extraction, transformation and loading. The process extracts data from a variety of sources and formats, transforms it into a standard structure, and loads it into a database, file, web service, or other system for analysis, visualization, machine learning, etc.
ETL tools come in a wide variety of shapes. Some run on your desktop or on-premise servers, while others run as SaaS in the cloud. Some are code-based, built on standard programming languages that many developers already know. Others are built on a custom DSL (domain specific language) in an attempt to be more intentional and require less code. Others still are completely graphical, only offering programming interfaces for complex transformations.
What follows is a list of ETL tools for developers already familiar with Java and the JVM (Java Virtual Machine) to clean, validate, filter, and prepare your data for use.
1. Data Pipeline
Data Pipeline is our own tool. It’s an ETL framework you plug into your software to load, processing, and migrate data on the JVM.
It uses a single API, modeled after the Java I/O classes, to handle data in a variety of formats and structures. It’s single-piece-flow approach to data allows it to handle huge amounts of data with a minimum overhead while still being able to scale using multi-threading. This approach also allows it to process both batch and streaming data through the same pipelines.
Data Pipeline comes in a range of versions including a free Express edition.
2. Jaspersoft ETL
The tool allows for a combination of relational and non-relational data sources. It also includes a business modeler for a non-technical view of the information workflow and a job designer for displaying and editing ETL steps. A debugger also exists for real-time debugging.
Other features include multiple output sources such as XML, databases, web services and native connectivity to ERP, CRM and Salesforce applications.
3. Scriptella ETL
Scriptella is an open source ETL tool written in Java.
KETL is a production ready ETL built on a multi-threaded XML architecture. It is a stand-alone package and has no third-party dependencies and notification tools.
KETL’s main features include Java-based data integration tool, security, data management and scalability across multiple servers.
5. Apatar ETL
Apatar is an open source ETL based on Java .
Its feature set include single-interface project integration, visual job designer for non-developers, bi-directional integration, platform independence and the ability to work with a wide range of applications and data sources such as Oracle, MS SQL and JDBC. These features not only make it a rival to competing commercial solutions but also make the ETL highly extensible.
6. Apache Crunch
Apache crunch is an open source Java API made to ease the process of writing, testing and running MapReduce pipelines.
It runs on top of Hadoop MapReduce and speeds up tasks that would otherwise be very slow on plain MapReduce. Examples of these tasks include data joining and integration. Apache Crunch is specifically well-suited for data that would not fit into a relational model such as time series, serialized object formats, Avro records and HBase rows and columns.
Scala users can also use Apache Crunch by using the Scrunch API. This includes read-eval-print loop (REPL) for creating MapReduce pipelines.
Cascading is an open source Java library used for data processing.
The API gives you a wide range of capabilities for solving business problems. Examples include sorting, averaging, filtering and merging. The tool also offers a data integration API that enables you to separate data integration dependencies from business logic.
Cascading supports reading and writing from a wide range of external sources. While you can build your own schemes and taps, you are also provided with pre-built taps and schemes.
8. Apache Oozie
Apachie Oozie is a Java web application used for scheduling Apache Hadoop jobs.
It is a reliable and scalable tool which forms a single logical unit by sequentially combining multiple jobs. It comes with built in support for various Hadoop jobs such as Java MapReduce, Streaming MapReduce, Pig, Hive, Sqoop and Distcp.
Apache Oozie also supports job scheduling for specific systems such as shell scripts and java programs.
Datasift is a powerful data validation and transformation framework.
It has been built with the purpose of targeting enterprise software development by providing developers with an extensible architecture. One of its key offerings is the ability to customize any feature as per your requirement.
10. Talend Open Studio for Data Integration
Talend is an open source tool that offers a wide range of data integration solutions.
Its graphical user interface allows for a drag-and-drop feature set which lets non-programmers execute complex integration tasks. Its support for connecting a large volume of application connectors help in combining with databases, mainframes and web services. Other features include the ability to manipulate strings, automatic lookup handling and management of changing dimensions.
11. Toolsverse ETL Framework
Toolsverse is a standalone and free ETL written in java.
12. GETL ETL
GETL is a free ETL engine based on a series of libraries.
These libraries are used for unpacking, transforming and deploying data into programs written in Java, Groovy and other similar languages based on Java classes and objects.
GETL provides for an automated data loading and transformation process. In addition, its classes have a simple hierarchy which make for easy connections with other solutions.
CloverETL is a commercial framework of power tools that can be used for data movement and transformation.
It supports data integration from various data sources and formats such as CSV, Excel and JDBC. Its graphical interface makes it easier for non-developers to perform data transformation tasks and provides an intuitive analysis of data connections between source and applications.
14. IBM Infosphere Datasage
IBM Infosphere Datasage is a commercial and scalable platform that integrates data spread out on multiple systems using a parallel framework.
It enables collection, transformation and integration of large amounts of data which can be of various types including big data in the form of streams. Datastage allows you to access distributed file systems for big data and provides support for JSON and JDBC connectors.
Datasage can be effectively deployed in the cloud as well as on-premise depending on project requirements.
15. Oracle Data Integrator
Oracle Data Integrator is a data integration tool.
It supports requirements such as high-volume, high performance batch loads and SOA-enabled data services. It also supports data movement and data transformation of large data volumes. The user interface of the tool is flow based thus providing for an intuitive integration process.
It is a free tool. An enterprise edition is also available.
16. Relational Junction ETL Manager
Relational Junction is an ETL manager which simplifies the process of moving and managing large amounts of data by removing the need for writing tedious code.
It provides a simplified SQL interface for data integration which does not require you to have prior development knowledge. Relational Junction ETL Manager comes with a web-based interface and can access various data sources such as relational databases, flat files, XML files, and SaaS applications. It can be integrated with Oracle, SQL Server, MySQL and Sybase.
17. Adeptia ETL Suite
Adeptia ETL Suite is a commercial software for aggregating, synchronizing and migrating data across systems and databases.
One of the key feature of the Adeptia ETL Suite is its intuitive graphical user interface which helps non-developers make their own data integration, mapping and transformation rules.
Adeptia ETL Suite gives you the ability to transform a number of different files such as flat file to XML, XML to CSV, database to XML and Excel to database.
The tool comes in two different packages. One is Adeptia Connect platform which runs in the cloud, and the other is Adeptia Integration Suite which runs within the enterprise.
18. Pentaho ETL
Pentaho ETL is a free standalone application that promises to deliver accurate data from any source.
Also known as Kettle, Pentaho ETL comes with a graphical drag and drop interface which makes the creation and integration of data fairly intuitive. It comes with pre-built libraries which enable you to access and prepare data from relational sources, big data and enterprise applications.
Pentaho ETL supports Hadoop distributions, NoSQL data sources and analytic databases.
19. Elixir Data ETL
Elixir Data ETL is a commercial web-based application designed to meet data processing requirements by providing on-demand self service and data manipulation for business users as well as enterprises.
One of its key feature is its visual interface for designing, testing and implementing data extraction, integration and transformation. This reduces development time and proves important for enterprise reporting and performance measurement.
20. Syncsort DMX
Syncsort DMX is a commercial data integration ETL.
Its salient feature is its in-memory ETL engine which can perform data transformations on the fly. This removes the need to store data in databases at different stages for processing.
Another feature of Syncsort DMX is its smart ETL optimizer which itself selects the most suitable algorithms for a given job.
Syncsort DMX comes with support for connectivity to a variety of sources such as all major RDBMS system, MPP appliances, flat files, XML files, HDFS and mainframe sources.
Smooks is an open source Java based extensible ETL engine built on top of DOM and SAX that works with structured data.
It supports the transformation of multiple and different data sources such as XML, CSV, EDI, JSON and to multiple and different formats such as XML, CSV, EDI, and JSON. With Smooks you can use transformation on just a portion of the data source or on the entire data source. It gives you the option of writing and using your own custom logic event handlers or you can choose to use a wide array of solutions that is shipped with the Smooks distribution.
With Smooks you can bind data sources such as XML, CSV, EDI, and JSON into a Java Object Model. It supports processing of Gigabytes of messages. It does with the help of a Filter made in SAX. The stream of messages can be split, transformed and routed to different data sources and databases. Multiple data sources and databases can be used to feed these messages.
22. Spring Batch
Spring Batch is a lightweight scalable batch processing open source tool.
Based on the POJO development approach of the Spring framework, it is designed with the purpose of enabling developers to make batch rich applications for vital business operations. Spring Batch comes with reusable functions such as tracing, transaction management, statistics, stop, start, restart, skip job and resource management which can be used when processing data. It supports a wide array of data input and output sources such as File, JDBC, NoSQL and JMS to name a few.
Some of the advanced features of Spring Batch such as optimization and partitioning methods can be leveraged for high volume and high-performance batch jobs.
23. Easy Batch
Easy Batch is a lightweight Java framework built with the purpose of removing complexity in batch processing.
Easy Batch lets you can focus on logic while leaving the tasks of reading, writing, filtering, parsing and validating data to the tool. It allows you to define data validation constraints on your domain object and takes care of the code itself. Development in Easy Batch is POJO based. This ensures the object oriented aspect of Java which enables you to map data to your domain objects.
Processing time in Easy Batch is reduced by enabling you to run multiple jobs in parallel. Job progress and execution can be monitored in real time using JMX. One of its salient features is that it has a very small memory footprint and Easy Batch has no dependencies.
Easy Batch can be ran in two ways. It can either be embedded into an application server or it can run as a standalone application.
24. Apache Camel
Apache Camel is an open source integration framework in Java that can be used to exchange, transform and route data among applications with different protocols.
It is a lightweight rule based routing and mediation engine that provides Java object based Enterprise Integration Patterns (EIP) by using either a Java API or a domain specific language. With Apache Camel, you have the ability to make your routing rules, determine the sources from which to accept messages and decide on how to process and send those messages to other components of the application. In order to make use of transports, different messaging frameworks such as HTTP, JMS and CXF URIs are used.
Apache Camel can be distributed in a web container or as a standalone application.
25. Which Java ETL tool do you use?
Have we left off your favourite Java ETL tool? Please add a comment letting us know about it.