How To Read An XML File With A Byte Order Mark (BOM)

Updated: Jul 30, 2024
XML

This example shows how to read an XML file with the Byte Order Mark (BOM).

The Byte Order Mark (BOM) is a special marker used in text files to indicate the encoding used. It's a sequence of bytes at the start of a file that helps programs recognize the text encoding. While the BOM can be helpful for identifying encoding, it’s not always used or needed. Some programs or systems might not handle the BOM correctly, which can cause issues, so its use can be a bit of a mixed bag depending on the context.

For example, in UTF-8 encoding, the BOM is represented by the bytes EF BB BF. In UTF-16, it could be FF FE for little-endian or FE FF for big-endian.

When we read an XML file having BOM using this example, it will generate the following exception message:

Message: Content is not allowed in prolog

To resolve such issues and ignore BOM, consider the following approach:

Java Code Listing

package com.northconcepts.datapipeline.examples.cookbook;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;

import org.apache.commons.io.ByteOrderMark;
import org.apache.commons.io.input.BOMInputStream;

import com.northconcepts.datapipeline.core.DataReader;
import com.northconcepts.datapipeline.core.DataWriter;
import com.northconcepts.datapipeline.core.StreamWriter;
import com.northconcepts.datapipeline.job.Job;
import com.northconcepts.datapipeline.xml.XmlReader;

public class ReadAnXmlFileWithByteOrderMark {
    
    public static void main(String[] args) throws Throwable {
        BOMInputStream bomInputStream = BOMInputStream.builder()
                .setInputStream(new FileInputStream(new File("example/data/input/xml_with_byte_order_marks.xml")))
                .setByteOrderMarks(
                        ByteOrderMark.UTF_8,
                        ByteOrderMark.UTF_16BE,
                        ByteOrderMark.UTF_16LE,
                        ByteOrderMark.UTF_32BE,
                        ByteOrderMark.UTF_32LE
                        )
                .setInclude(false)
                .get();
        
        DataReader reader = new XmlReader(new InputStreamReader(bomInputStream))
        	.addField("title", "//book/title/text()")
        	.addField("language", "//book/title/@lang")
        	.addField("price", "//book/price/text()")
        	.addRecordBreak("//book");

        DataWriter writer = StreamWriter.newSystemOutWriter();
        Job.run(reader, writer);
    }

}

Code Walkthrough

  1. Create an input stream of BOMInputStream.
  2. Set the Byte Order Mark (BOM) to detect and exclude.
  3. Set include as false to ignore BOM.
  4. Read an XML file by setting fields and record breaks.
  5. Data are transferred from BOMInputStream to Console via Job.run() method. See how to compile and run data pipeline jobs.

Console Output

-----------------------------------------------
0 - Record (MODIFIED) {
    0:[title]:STRING=[Harry Potter]:String
    1:[language]:STRING=[eng]:String
    2:[price]:STRING=[29.99]:String
}

-----------------------------------------------
1 - Record (MODIFIED) {
    0:[title]:STRING=[Learning XML]:String
    1:[language]:STRING=[eng]:String
    2:[price]:STRING=[39.95]:String
}

-----------------------------------------------
2 records
Mobile Analytics