Lookup in a Flat File

Updated: Jul 5, 2023

This example performs data lookups in flat files, allowing users to retrieve information from multiple files and combine them based on one or more common fields.

This is useful in cases where data is spread across different files or systems, enabling users to integrate and consolidate the data for further analysis or processing.

Input CSV files

dbip-country-lite-2018-10.csv

0.0.0.0,0.255.255.255,ZZ
1.0.0.0,1.0.0.255,AU
1.0.1.0,1.0.3.255,CN
1.0.4.0,1.0.7.255,AU
1.0.8.0,1.0.15.255,CN
...

countries.csv

"Country (en)";"Country (de)";"Country (local)";"Country code";"Continent";"Capital";"Population";"Area";"Coastline";"Government form";"Currency";"Currency code";"Dialing prefix";"Birthrate";"Deathrate";"Life expectancy";"Url"
"Afghanistan";"Afghanistan";"Afganistan/Afqanestan";"AF";"Asia";;33332025;652230;0;"Presidential islamic republic";"Afghani";"AFN";93;38.3;13.7;51.3;"https://www.laenderdaten.info/Asien/Afghanistan/index.php"
"Egypt";"Ägypten";"Misr";"EG";"Africa";;94666993;1001450;2450;"Presidential republic";"Pfund";"EGP";20;30.3;4.7;72.7;"https://www.laenderdaten.info/Afrika/Aegypten/index.php"
"Åland Islands";"Ålandinseln";"Åland";"AX";"Europe";;29013;1580;0;"Autonomous region of Finland";"Euro";"EUR";358;0;0;0;"https://www.laenderdaten.info/Europa/Aland/index.php"
"Albania";"Albanien";"Shqipëria";"AL";"Europe";;3038594;28748;362;"parliamentary republic";"Lek";"ALL";355;13.1;6.7;78.3;"https://www.laenderdaten.info/Europa/Albanien/index.php"
"Algeria";"Algerien";"Al-Jaza’ir/Algérie";"DZ";"Africa";;40263711;2381741;998;"Presidential republic";"Dinar";"DZD";213;23;4.3;76.8;"https://www.laenderdaten.info/Afrika/Algerien/index.php"
...

Java Code Listing

package com.northconcepts.datapipeline.examples.lookups;

import java.io.File;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.FieldList;
import com.northconcepts.datapipeline.csv.CSVReader;
import com.northconcepts.datapipeline.csv.CSVWriter;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.transform.TransformingReader;
import com.northconcepts.datapipeline.transform.lookup.DataReaderLookup;
import com.northconcepts.datapipeline.transform.lookup.Lookup;
import com.northconcepts.datapipeline.transform.lookup.LookupTransformer;

public class LookupDataInFlatFile {

    private static final String INPUT_FOLDER = "data/input";
    private static final String OUTPUT_FOLDER = "data/output";

    public static void main(String[] args) {
        
        DataReader reader = new CSVReader(new File(INPUT_FOLDER, "dbip-country-lite-2018-10.csv"))
                .setFieldNamesInFirstRow(false)
                .setFieldNames("ip_start", "ip_end", "country");
        
        Lookup lookupCountries = new DataReaderLookup(
                new CSVReader(new File(INPUT_FOLDER, "countries.csv")).setFieldSeparator(';').setFieldNamesInFirstRow(true),
                new FieldList("Country code"));        
        
        reader = new TransformingReader(reader)
                .add(new LookupTransformer(new FieldList("country"), lookupCountries).setAllowNoResults(false))
                .setExceptionOnFailure(false);
        
        DataWriter writer = new CSVWriter(new File(OUTPUT_FOLDER, "ip-countries.csv"))
                .setFieldNamesInFirstRow(true);
        
        Job.run(reader, writer);
    }

}

Code walkthrough

  1. CSVReader is created corresponding to the input file dbip-country-lite-2018-10.csv. The first row in the file includes 0.0.0.0, 0.255.255.255 and ZZ
  2. CSVReader.setFieldNamesInFirstRow(false) and setFieldNames("ip_start", "ip_end", "country") methods are used to specify the field names in the first row of the input dataset.
  3. DataReaderLookup object accepts two arguments:
    • DataReader object to read the input file. In the example, it is CSVReader class that gets data from countries.csv
    • parameterFields are the fields that need to be matched with another dataset. In other words, this field specifies the common field in two datasets. In the example for this argument, "Country code" is wrapped in the FieldList object. 
  4. TransformingReader is a proxy that applies transformations to records passing through. 
  5. The lookup matches "dbip-country-lite-2018-10.csv".[country] with "countries.csv".[Country code]. The reader contains the full data in two files merged with the common field.   
  6. Data is transferred from the reader to the CSVWriter via Job.run() method. 

Output

ip_start,ip_end,country,Country (en),Country (de),Country (local),Country code,Continent,Capital,Population,Area,Coastline,Government form,Currency,Currency code,Dialing prefix,Birthrate,Deathrate,Life expectancy,Url
1.0.0.0,1.0.0.255,AU,Australia,Australien,Australia,AU,Australia,,22992654,7741220,25760,Parliamentary democracy (under constitutional monarchy),Dollar,AUD,61,12.1,7.2,82.2,https://www.laenderdaten.info/Australien/Australien/index.php
1.0.1.0,1.0.3.255,CN,China,China,Zhongquo,CN,Asia,,1373541278,9596960,14500,People's republic (communist 	one-party system),Yuan,CNY,86,12.4,7.7,75.5,https://www.laenderdaten.info/Asien/China/index.php
1.0.4.0,1.0.7.255,AU,Australia,Australien,Australia,AU,Australia,,22992654,7741220,25760,Parliamentary democracy (under constitutional monarchy),Dollar,AUD,61,12.1,7.2,82.2,https://www.laenderdaten.info/Australien/Australien/index.php
1.0.8.0,1.0.15.255,CN,China,China,Zhongquo,CN,Asia,,1373541278,9596960,14500,People's republic (communist 	one-party system),Yuan,CNY,86,12.4,7.7,75.5,https://www.laenderdaten.info/Asien/China/index.php
1.0.16.0,1.0.31.255,JP,Japan,Japan,Nihon/Nippon,JP,Asia,,126702133,377915,29751,Parliamentary constitutional monarchy,Yen,JPY,81,7.8,9.6,85,https://www.laenderdaten.info/Asien/Japan/index.php
....
Mobile Analytics