Read a Patient File
Updated: Sep 16, 2023
This example shows how to utilize DataPipeline features to process several large files in a hospital scenario. A wide range of data manipulation tasks, including data filtering, column removal, renaming, and data lookup (akin to SQL joins) are implemented sequentially.
Input Files
patient-visits-raw-500000.csv
id,patient_given_name,patient_middle_initial,patient_surname,patient_gender,patient_birthday,patient_street_address,patient_city,patient_state,patient_zipcode,patient_country,patient_telephone_number,procedure_name,hospital_name,hospital_address1,hospital_city,hospital_state,hospital_zip_code,procedure_date 1,Thelma,R,Alva,female,3/11/1933,2846 Poe Road,Beaufort,SC,29902,US,843-271-0098,Laparoscopic procedures for creation of esophagogastric sphincteric competence,COLUMBUS COM HSPTL,1515 PARK AVE,COLUMBUS,WI,53925,Sun Mar 21 21:28:18 EDT 2010 2,Mary,L,Fernandez,female,4/10/1935,4348 Mercer Street,Antigo,WI,54409,US,715-623-8845,Other incision of esophagus,MERCY HOSPITAL ANDERSON,7500 STATE ROAD,CINCINNATI,OH,45255,Sun Nov 22 02:35:50 EST 2015 3,Rodney,K,Snyder,male,10/1/1981,31 Jody Road,Philadelphia,PA,19103,US,610-549-1746,"Other division of bone, unspecified site",MONTCLAIR HOSPITAL MEDICAL CENTER,5000 SAN BERNARDINO ST,MONTCLAIR,CA,91763,Sun Jul 02 18:12:34 EDT 2006
...
hospital.csv
"Provider Number","Hospital Name","Address 1","Address 2","Address 3","City","State","ZIP Code","County","Phone Number","Hospital Type","Hospital Ownership","Emergency Services"
"010001","SOUTHEAST ALABAMA MEDICAL CENTER","1108 ROSS CLARK CIRCLE","","","DOTHAN","AL","36301","HOUSTON","3347938701","Acute Care Hospitals","Government - Hospital District or Authority","Yes"
"010005","MARSHALL MEDICAL CENTER SOUTH","2505 U S HIGHWAY 431 NORTH","","","BOAZ","AL","35957","MARSHALL","2565938310","Acute Care Hospitals","Government - Hospital District or Authority","Yes"
"010006","ELIZA COFFEE MEMORIAL HOSPITAL","205 MARENGO STREET","","","FLORENCE","AL","35631","LAUDERDALE","2567688400","Acute Care Hospitals","Government - Hospital District or Authority","Yes"
...
procedure.xlsx
PROCEDURE CODE LONG DESCRIPTION SHORT DESCRIPTION 0001 Therapeutic ultrasound of vessels of head and neck Ther ult head & neck ves 0002 Therapeutic ultrasound of heart Ther ultrasound of heart 0003 Therapeutic ultrasound of peripheral vascular vessels Ther ult peripheral ves
...
Java Code Listing
package com.northconcepts.datapipeline.examples.cookbook; import java.io.File; import com.northconcepts.datapipeline.core.DataReader; import com.northconcepts.datapipeline.core.DataWriter; import com.northconcepts.datapipeline.core.FieldList; import com.northconcepts.datapipeline.core.NullWriter; import com.northconcepts.datapipeline.csv.CSVReader; import com.northconcepts.datapipeline.csv.CSVWriter; import com.northconcepts.datapipeline.excel.ExcelDocument; import com.northconcepts.datapipeline.excel.ExcelReader; import com.northconcepts.datapipeline.filter.FilterExpression; import com.northconcepts.datapipeline.job.Job; import com.northconcepts.datapipeline.transform.BasicFieldTransformer; import com.northconcepts.datapipeline.transform.RemoveFields; import com.northconcepts.datapipeline.transform.RenameField; import com.northconcepts.datapipeline.transform.SelectFields; import com.northconcepts.datapipeline.transform.SetCalculatedField; import com.northconcepts.datapipeline.transform.TransformingReader; import com.northconcepts.datapipeline.transform.lookup.DataReaderLookup; import com.northconcepts.datapipeline.transform.lookup.Lookup; import com.northconcepts.datapipeline.transform.lookup.LookupTransformer; import com.northconcepts.datapipeline.validate.ValidatingReader; public class ReadPatientFile { private static final File inputFile = new File("example/data/input/patient-visits-raw-500000.csv"); private static final File outputFile = new File("example/data/output/patient-visits-"+System.currentTimeMillis()+".csv"); private static final File hospitalLookupFile = new File("example/data/input/hospital.csv"); private static final File procedureLookupFile = new File("example/data/input/procedure.xlsx"); public static void main(String[] args) { DataReader reader = new CSVReader(inputFile) .setFieldNamesInFirstRow(true); reader = new ValidatingReader(reader) .add(new FilterExpression("patient_gender == 'female' || patient_gender == 'male'")); reader = new TransformingReader(reader) .add(new RemoveFields("Id")); reader = new TransformingReader(reader) .add(new BasicFieldTransformer("patient_birthday").stringToDate("MM/dd/yyyy")) .add(new BasicFieldTransformer("procedure_date").stringToDate("EEE MMM dd HH:mm:ss z yyyy")); // reader = new FilteringReader(reader) // .add(new FilterExpression("patient_gender == 'female'")); // reader = new LimitReader(reader, 2); // Lookup hospitals Lookup hospitalLookup = new DataReaderLookup( getHospitals(), new FieldList("hospital_name", "address1", "city", "state", "zip_code"), new FieldList("hospital_id")); // reader = new AsyncReader(reader); reader = new TransformingReader(reader, new NullWriter(), "error_message") .add(new LookupTransformer(new FieldList("hospital_name", "hospital_address1", "hospital_city", "hospital_state", "hospital_zip_code"), hospitalLookup)); // Lookup procedures Lookup procedureLookup = new DataReaderLookup( getMedicalProcedures(), new FieldList("procedure_name"), new FieldList("procedure_id")); reader = new TransformingReader(reader, new NullWriter(), "error_message") .add(new LookupTransformer(new FieldList("procedure_name"), procedureLookup)); // Exclude redundant fields -- due to lookup mapping to IDs reader = new TransformingReader(reader, new NullWriter(), "error_message") .add(new RemoveFields("hospital_name", "hospital_address1", "hospital_city", "hospital_state", "hospital_zip_code")) .add(new RemoveFields("procedure_name")); // DataWriter writer = StreamWriter.newSystemOutWriter(); DataWriter writer = new CSVWriter(outputFile); Job job = Job.run(reader, writer); System.out.println(job.getRecordsTransferred() + " ==> " + job.getRunningTimeAsString()); System.out.println(reader.getRecordCount()); System.out.println(writer.getRecordCount()); } // can return any type of DataReader for lookup, including JdbdReader private static DataReader getHospitals() { DataWriter discardWriter = new NullWriter(); DataReader reader = new CSVReader(hospitalLookupFile) .setFieldNamesInFirstRow(true); reader = new TransformingReader(reader, discardWriter, "error_message") .add(new RenameField("Provider Number", "hospital_id")) .add(new BasicFieldTransformer("hospital_id").stringToLong()) .add(new RenameField("Hospital Name", "hospital_name")) .add(new RenameField("Address 1", "address1")) .add(new RenameField("Address 2", "address2")) .add(new RenameField("Address 3", "address3")) .add(new RenameField("City", "city")) .add(new RenameField("State", "state")) .add(new RenameField("ZIP Code", "zip_code")) .add(new RenameField("County", "county")) .add(new RenameField("Phone Number", "phone_number")) .add(new RenameField("Hospital Type", "hospital_type")) .add(new RenameField("Hospital Ownership", "hospital_ownership")) .add(new RenameField("Emergency Services", "emergency_services")) .add(new SetCalculatedField("emergency_services", "decode(emergency_services, 'Yes', true, 'No', false)")) ; return reader; } // can return any type of DataReader for lookup, including JdbdReader private static DataReader getMedicalProcedures() { DataWriter discardWriter = new NullWriter(); ExcelDocument document = new ExcelDocument() .open(procedureLookupFile); DataReader reader = new ExcelReader(document) .setFieldNamesInFirstRow(true); reader = new TransformingReader(reader, discardWriter, "error_message") .add(new RenameField("PROCEDURE CODE", "procedure_id")) .add(new RenameField("LONG DESCRIPTION", "procedure_name")) .add(new RenameField("SHORT DESCRIPTION", "procedure_name_short")) .add(new SelectFields("procedure_id", "procedure_name", "procedure_name_short")) ; return reader; } }
Code Walkthrough
- CSVReader is created corresponding to the input file
patient-visits-raw-500000.csv
. - ValidatingReader is used with FilterExpression instance as an argument to filter out records based on
patient_gender
column. - TransformingReader is created to apply transformations to records as they are being obtained from the reader. Using this instance,
Id
field is removed first. Then, data types ofpatient_birthday
andprocedure_date
are converted from string to Date. - DataReaderLookup object accepts three arguments:
- DataReader object to read the input file. In the example, it is DataReader instance returned from
getHospitals()
method. parameterFields
are the fields that need to be matched with another dataset. In other words, this field specifies the common fields in two datasets.resultFields
are the fields that are returned after the lookup operation. In the given example, onlyhospital_id
column is returned.
- DataReader object to read the input file. In the example, it is DataReader instance returned from
- The lookup matches five columns from the
patient-visits-raw-500000.csv
with those in thehospital.csv
. The reader contains the full data from the former along with thehospital_id
field from the latter file. - The previous two steps are repeated for
procedure.xlsx
. Records on two files are matched based onprocedure_name
column and, as a result,procedure_id
column is added to the original patient file. - Next, redundant fields are removed from the reader. In the end, only two fields
hospital_id
andprocedure_id
are kept in each record without any additional details about hospitals and procedures. - CSVWriter is created corresponding to the output file
patient-visits-1693932649170.csv
. The numbers on the file name represent the current time, so it will be different for you. - Job.run() is used to transfer the data from
reader
towriter
. See how to compile and run data pipeline jobs. - Total number of records and their processing time are printed on the console.
Output File
patient_given_name,patient_middle_initial,patient_surname,patient_gender,patient_birthday,patient_street_address,patient_city,patient_state,patient_zipcode,patient_country,patient_telephone_number,procedure_date,hospital_id,procedure_id Thelma,R,Alva,female,1933-03-11,2846 Poe Road,Beaufort,SC,29902,US,843-271-0098,2010-03-22,521338,4467 Mary,L,Fernandez,female,1935-04-10,4348 Mercer Street,Antigo,WI,54409,US,715-623-8845,2015-11-22,360001,4209 Rodney,K,Snyder,male,1981-10-01,31 Jody Road,Philadelphia,PA,19103,US,610-549-1746,2006-07-03,50758,7730
...
Console Output
486675 ==> 5 Seconds, 770 Milliseconds 486675 486675