With data being produced from many sources in a variety of formats it’s imperative for businesses to have a sane way to gain useful insight. Data integration is the process of transforming data from one or more sources into a form that can be loaded into a target system or used for analysis and business intelligence.
1. Data Pipeline
Data Pipeline is our own tool. It’s an ETL framework you plug into your software to load, processing, and migrate data on the JVM.
It uses a single API, modeled after the Java I/O classes, to handle data in a variety of formats and structures. It’s single-piece-flow approach to data allows it to handle huge amounts of data with a minimum overhead while still being able to scale using multi-threading. This approach also allows it to process both batch and streaming data through the same pipelines.
Data Pipeline comes in a range of versions including a free Express edition.
2. Easy Batch
Easy Batch was built with the aim of getting rid of boilerplate code that is usually required for setting up reading, writing, filtering, parsing and validating data, logging and reporting. It allows the user to just work on the application logic and not worry about these tasks.
Working with Easy Batch is simple. Records are read from a data source, processed in the pipeline and written in batches to a data sink. Record and Batch APIs provided by the framework are used for processing records from any type of data source.
3. Apache Camel
Apache Camel is an enterprise integration framework in Java. It can be easily embedded in a Java application with a very small number of dependencies.
It provides the use of domain-specific languages for defining routing and mediation rules. Although a number of domain-specific languages are supported by Apache Camel including Spring, Scala DSL and Blueprint XML. It is recommended that Java based Fluent API be used for defining routing and mediation rules. URIs are used for working with transport or messaging models such as HTTP, ActiveMQ, JMS, JBI, SCA, MINA or CXF.
Apache Camel can also be easily integrated with other frameworks such as CDI, Spring, Blueprint and Guice. It also provides support for bean binding and unit testing.
4. Apache Samza
Apache Samza is a fault tolerant and real-time data processing framework.
It supports data from multiple sources including Apache Kafka. It comes with a simple API which can be used with both batch and streaming data for creating business logic of the application. Samza can be used as a library and can be easily embedded into Java and Scala applications.
Samza comes with host-affinity and incremental check-pointing that provide fast recovery from failures.
PocketETL is an extensible library in Java which performs extract, transform and load of data between services using Java for creating pipelines.
PocketETL is built mainly for the cloud. It comes with built-in support for AWS services such as S3, SQS and Redshift.
GETL is a set of libraries which automates the process of loading and transforming data.
It is based in Groovy and consists of classes and objects which can be used out of the box for unpacking, transforming and loading data into Java or Groovy programs. It can also be used with any software works with Java classes.
Scriptella is an open source ETL and script execution tool in Java.
The salient feature of Scriptella is its simplicity. With Scriptella languages such as SQL can be used can be used to perform transformations. It supports interoperability with multiple data sources including LDAP, JDBC and XML.
Scriptella can also be integrated with Ant.
8. JSR 352
JSR 352 is a native Java library for batch processing.
It is a simple specification that enables automation of a variety of tasks which would otherwise take a lot of time and are bound to cause errors. With JSR 352 provides a template for batch processing. This enables application developers to mostly handle the business logic of their application.
9. Spring Batch
Spring Batch provides a framework for the development of batch applications that are usually involved in enterprise systems.
It is a lightweight framework that is suitable for both simple and complex jobs. It boasts of providing multiple features and services. Features such as reusable functions which are required for processing large records, logging/tracing, transaction management, job processing statistics, job restart, skip and resource management. Optimization and partitioning techniques are employed for high-volume and high performance batch job.
Univocity is a data integration framework built with the purpose of hiding complexities that are usually are characteristics of ETL, data mappings and data synchronization processes.
Univocity boasts of simplifying the data mapping process by just letting the user define mapping from source to destination and it will automatically manage rest of operations. It offers greater control over the entire process of data mapping and is not reliant on built-in connectors and data transformation functions.
With Univocity users can perform schema migration with referential integrity. Updates are detected and applied automatically using a number data management strategies provided by Univocity.
11. Apache Flink
Apache Flink is a distributed processing engine and framework that can perform stateful computations over bounded and unbounded data streams.
One of the salient features of Flink is that it can be deployed on all cluster environments such as Hadoop YARN, Apache Mesos and Kubernetes. It can run computations at in-memory speed and is scalable.
12. Apache Storm
Apache storm is another real-time stream processing system. It is based on Java and can be run on any JVM setup, along with Python, Ruby and Perl.
The main selling points of the tool are its low latency, easy setup, user friendly interface, parallel processing, cluster deployability and a wide range of language support for topology formation.