Read a Patient File

Updated: Sep 16, 2023

This example shows how to utilize DataPipeline features to process several large files in a hospital scenario. A wide range of data manipulation tasks, including data filtering, column removal, renaming, and data lookup (akin to SQL joins) are implemented sequentially.

 

Input Files

patient-visits-raw-500000.csv

id,patient_given_name,patient_middle_initial,patient_surname,patient_gender,patient_birthday,patient_street_address,patient_city,patient_state,patient_zipcode,patient_country,patient_telephone_number,procedure_name,hospital_name,hospital_address1,hospital_city,hospital_state,hospital_zip_code,procedure_date
1,Thelma,R,Alva,female,3/11/1933,2846 Poe Road,Beaufort,SC,29902,US,843-271-0098,Laparoscopic procedures for creation of esophagogastric sphincteric competence,COLUMBUS COM HSPTL,1515 PARK AVE,COLUMBUS,WI,53925,Sun Mar 21 21:28:18 EDT 2010
2,Mary,L,Fernandez,female,4/10/1935,4348 Mercer Street,Antigo,WI,54409,US,715-623-8845,Other incision of esophagus,MERCY HOSPITAL ANDERSON,7500 STATE ROAD,CINCINNATI,OH,45255,Sun Nov 22 02:35:50 EST 2015
3,Rodney,K,Snyder,male,10/1/1981,31 Jody Road,Philadelphia,PA,19103,US,610-549-1746,"Other division of bone, unspecified site",MONTCLAIR HOSPITAL MEDICAL CENTER,5000 SAN BERNARDINO ST,MONTCLAIR,CA,91763,Sun Jul 02 18:12:34 EDT 2006
...

 

hospital.csv

"Provider Number","Hospital Name","Address 1","Address 2","Address 3","City","State","ZIP Code","County","Phone Number","Hospital Type","Hospital Ownership","Emergency Services"
"010001","SOUTHEAST ALABAMA MEDICAL CENTER","1108 ROSS CLARK CIRCLE","","","DOTHAN","AL","36301","HOUSTON","3347938701","Acute Care Hospitals","Government - Hospital District or Authority","Yes"
"010005","MARSHALL MEDICAL CENTER SOUTH","2505 U S HIGHWAY 431 NORTH","","","BOAZ","AL","35957","MARSHALL","2565938310","Acute Care Hospitals","Government - Hospital District or Authority","Yes"
"010006","ELIZA COFFEE MEMORIAL HOSPITAL","205 MARENGO STREET","","","FLORENCE","AL","35631","LAUDERDALE","2567688400","Acute Care Hospitals","Government - Hospital District or Authority","Yes"
...


procedure.xlsx

PROCEDURE CODE	LONG DESCRIPTION	                                SHORT DESCRIPTION
0001	        Therapeutic ultrasound of vessels of head and neck	Ther ult head & neck ves
0002	        Therapeutic ultrasound of heart	                        Ther ultrasound of heart
0003	        Therapeutic ultrasound of peripheral vascular vessels	Ther ult peripheral ves
...

 

Java Code Listing

package com.northconcepts.datapipeline.examples.cookbook;

import java.io.File;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.FieldList;
import com.northconcepts.datapipeline.core.NullWriter;
import com.northconcepts.datapipeline.csv.CSVReader;
import com.northconcepts.datapipeline.csv.CSVWriter;
import com.northconcepts.datapipeline.excel.ExcelDocument;
import com.northconcepts.datapipeline.excel.ExcelReader;
import com.northconcepts.datapipeline.filter.FilterExpression;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.transform.BasicFieldTransformer;
import com.northconcepts.datapipeline.transform.RemoveFields;
import com.northconcepts.datapipeline.transform.RenameField;
import com.northconcepts.datapipeline.transform.SelectFields;
import com.northconcepts.datapipeline.transform.SetCalculatedField;
import com.northconcepts.datapipeline.transform.TransformingReader;
import com.northconcepts.datapipeline.transform.lookup.DataReaderLookup;
import com.northconcepts.datapipeline.transform.lookup.Lookup;
import com.northconcepts.datapipeline.transform.lookup.LookupTransformer;
import com.northconcepts.datapipeline.validate.ValidatingReader;

public class ReadPatientFile {

    private static final File inputFile = new File("example/data/input/patient-visits-raw-500000.csv");
    private static final File outputFile = new File("example/data/output/patient-visits-"+System.currentTimeMillis()+".csv");

    private static final File hospitalLookupFile = new File("example/data/input/hospital.csv");
    private static final File procedureLookupFile = new File("example/data/input/procedure.xlsx");

    public static void main(String[] args) {
        
        DataReader reader = new CSVReader(inputFile)
                .setFieldNamesInFirstRow(true);
        
        reader = new ValidatingReader(reader)
                .add(new FilterExpression("patient_gender == 'female' || patient_gender == 'male'"));

        reader = new TransformingReader(reader)
                .add(new RemoveFields("Id"));
        
        reader = new TransformingReader(reader)
                .add(new BasicFieldTransformer("patient_birthday").stringToDate("MM/dd/yyyy"))
                .add(new BasicFieldTransformer("procedure_date").stringToDate("EEE MMM dd HH:mm:ss z yyyy"));
        
//        reader = new FilteringReader(reader)
//                .add(new FilterExpression("patient_gender == 'female'"));
        
//        reader = new LimitReader(reader, 2);
        
        // Lookup hospitals
        Lookup hospitalLookup = new DataReaderLookup(
                getHospitals(),
                new FieldList("hospital_name", "address1", "city", "state", "zip_code"), 
                new FieldList("hospital_id"));

//        reader = new AsyncReader(reader);
        reader = new TransformingReader(reader,  new NullWriter(), "error_message")
                .add(new LookupTransformer(new FieldList("hospital_name", "hospital_address1", "hospital_city", "hospital_state", "hospital_zip_code"), hospitalLookup));

        // Lookup procedures
        Lookup procedureLookup = new DataReaderLookup(
                getMedicalProcedures(),
                new FieldList("procedure_name"), 
                new FieldList("procedure_id"));

        reader = new TransformingReader(reader,  new NullWriter(), "error_message")
                .add(new LookupTransformer(new FieldList("procedure_name"), procedureLookup));
        
        // Exclude redundant fields -- due to lookup mapping to IDs
        reader = new TransformingReader(reader,  new NullWriter(), "error_message")
                .add(new RemoveFields("hospital_name", "hospital_address1", "hospital_city", "hospital_state", "hospital_zip_code"))
                .add(new RemoveFields("procedure_name"));
        
//        DataWriter writer = StreamWriter.newSystemOutWriter();
        DataWriter writer = new CSVWriter(outputFile);

        Job job = Job.run(reader, writer);
        
        System.out.println(job.getRecordsTransferred() + " ==> " + job.getRunningTimeAsString());
        System.out.println(reader.getRecordCount());
        System.out.println(writer.getRecordCount());
    }
    
    // can return any type of DataReader for lookup, including JdbdReader
    private static DataReader getHospitals() {
        DataWriter discardWriter = new NullWriter();
        
        DataReader reader = new CSVReader(hospitalLookupFile)
                .setFieldNamesInFirstRow(true);
        
        reader = new TransformingReader(reader, discardWriter, "error_message")
                .add(new RenameField("Provider Number", "hospital_id"))
                .add(new BasicFieldTransformer("hospital_id").stringToLong())
                .add(new RenameField("Hospital Name", "hospital_name"))
                .add(new RenameField("Address 1", "address1"))
                .add(new RenameField("Address 2", "address2"))
                .add(new RenameField("Address 3", "address3"))
                .add(new RenameField("City", "city"))
                .add(new RenameField("State", "state"))
                .add(new RenameField("ZIP Code", "zip_code"))
                .add(new RenameField("County", "county"))
                .add(new RenameField("Phone Number", "phone_number"))
                .add(new RenameField("Hospital Type", "hospital_type"))
                .add(new RenameField("Hospital Ownership", "hospital_ownership"))
                .add(new RenameField("Emergency Services", "emergency_services"))
                .add(new SetCalculatedField("emergency_services", "decode(emergency_services, 'Yes', true, 'No', false)"))
                ;
        
        return reader;
    }
    
    // can return any type of DataReader for lookup, including JdbdReader
    private static DataReader getMedicalProcedures() {
        DataWriter discardWriter = new NullWriter();
        
        ExcelDocument document = new ExcelDocument()
                .open(procedureLookupFile);
        
        DataReader reader = new ExcelReader(document)
                .setFieldNamesInFirstRow(true);

        reader = new TransformingReader(reader, discardWriter, "error_message")
                .add(new RenameField("PROCEDURE CODE", "procedure_id"))
                .add(new RenameField("LONG DESCRIPTION", "procedure_name"))
                .add(new RenameField("SHORT DESCRIPTION", "procedure_name_short"))
                .add(new SelectFields("procedure_id", "procedure_name", "procedure_name_short"))
                ;
        
        return reader;
    }
    
}

 

Code Walkthrough

  1. CSVReader is created corresponding to the input file patient-visits-raw-500000.csv.
  2. ValidatingReader is used with FilterExpression instance as an argument to filter out records based on patient_gender column.
  3. TransformingReader is created to apply transformations to records as they are being obtained from the reader. Using this instance, Id field is removed first. Then, data types of patient_birthday and procedure_date are converted from string to Date.
  4. DataReaderLookup object accepts three arguments:
    • DataReader object to read the input file. In the example, it is DataReader instance returned from getHospitals() method.
    • parameterFields are the fields that need to be matched with another dataset. In other words, this field specifies the common fields in two datasets.
    • resultFields are the fields that are returned after the lookup operation. In the given example, only hospital_id column is returned.
  5. The lookup matches five columns from the patient-visits-raw-500000.csv with those in the hospital.csv. The reader contains the full data from the former along with the hospital_id field from the latter file. 
  6. The previous two steps are repeated for procedure.xlsx. Records on two files are matched based on procedure_name column and, as a result, procedure_id column is added to the original patient file.
  7. Next, redundant fields are removed from the reader. In the end, only two fields hospital_id and procedure_id are kept in each record without any additional details about hospitals and procedures.
  8. CSVWriter is created corresponding to the output file patient-visits-1693932649170.csv. The numbers on the file name represent the current time, so it will be different for you.
  9. Job.run() is used to transfer the data from reader to writer. See how to compile and run data pipeline jobs.
  10. Total number of records and their processing time are printed on the console.

 

Output File

patient_given_name,patient_middle_initial,patient_surname,patient_gender,patient_birthday,patient_street_address,patient_city,patient_state,patient_zipcode,patient_country,patient_telephone_number,procedure_date,hospital_id,procedure_id
Thelma,R,Alva,female,1933-03-11,2846 Poe Road,Beaufort,SC,29902,US,843-271-0098,2010-03-22,521338,4467
Mary,L,Fernandez,female,1935-04-10,4348 Mercer Street,Antigo,WI,54409,US,715-623-8845,2015-11-22,360001,4209
Rodney,K,Snyder,male,1981-10-01,31 Jody Road,Philadelphia,PA,19103,US,610-549-1746,2006-07-03,50758,7730
...

 

Console Output

486675 ==> 5 Seconds, 770 Milliseconds
486675
486675
Mobile Analytics