One question I like to ask in interviews is: how would you speed up inserts when using JDBC?
This simple question usually shows me how knowledgeable the developer is with databases in general and JDBC specifically.
If you ever find yourself needing to insert data quickly to a SQL database (and not just being asked it in an interview), here are some options to consider.
1. Don’t use JDBC
Most databases have a native tool to handle bulk inserts. You don’t always have this option, but if your data is in a format the tool supports, you’re going to be hard pressed to find a faster option.
2. Use Prepared Statements
Prepared statements not only reduce insert time by compiling the SQL once, but they also help prevent SQL injection attacks. (I also like to ask interviewees about SQL injection attacks.)
Once a prepared statement is parsed by the database engine, the client only needs to send the data for each statement instead of the entire SQL each time.
3. Use Batch Inserts
Prepared statements are your friend when using JDBC, but batch statements are going to be your BFF when doing inserts and updates.
We have a demo in Data Pipeline that loads some hospital data into a MySQL database. Using batch statements of 100 records, insert time dropped from 3.5 minutes to under 1 second.
Two things you should know about the data: 1) The data isn’t perfect. It does have bad records and duplicate IDs causing some records to fail. 2) We need to do a little massaging to map the source and target formats.
Here’s the code from the demo.
DataWriter discardWriter = new StreamWriter(System.out);
DataReader reader = new CSVReader(new File("data/hospital.csv"))
reader = new TransformingReader(reader, discardWriter, "error_message")
.add(new RenameField("Provider Number", "hospital_id"))
.add(new RenameField("Hospital Name", "hospital_name"))
.add(new RenameField("Address 1", "address1"))
.add(new RenameField("Address 2", "address2"))
.add(new RenameField("Address 3", "address3"))
.add(new RenameField("City", "city"))
.add(new RenameField("State", "state"))
.add(new RenameField("ZIP Code", "zip_code"))
.add(new RenameField("County", "county"))
.add(new RenameField("Phone Number", "phone_number"))
.add(new RenameField("Hospital Type", "hospital_type"))
.add(new RenameField("Hospital Ownership", "hospital_ownership"))
.add(new RenameField("Emergency Services", "emergency_services"))
"decode(emergency_services, 'Yes', true, 'No', false)"))
DataWriter writer = new JdbcWriter(getConnection(), "hospital")
Job job = Job.run(reader, writer);
System.out.println("Time: " + job.getRunningTimeAsString());
If you’d like to learn more about the above code, please have a look at our Java ETL framework.
4. Use Multiple Connections
Depending on your database and configuration, you could see an improvement using more than one connection to do inserts.
We saw improvements with up to 5 connections in a test with one of the big database vendors. Going above 5 connections saw no noticeable benefit.
5. Temporarily Disable Integrity Constraints
This is one of those tips you have to be very careful with. I don’t recommend it unless you absolutely know what you’re doing, and most importantly, have a rollback plan.
Disabling integrity checks means your database engine can insert data faster because it has less work to do for each record. However, if your data isn’t perfectly clean, you can end up with corrupted tables and broken client apps. You also don’t ever want to be the guy who forgot to re-enable constraints on a production database.
What’s your trick for speeding up inserts with JDBC?
Drop me a comment if you have an insert trick to share.