Spark: Read Xml files using XmlInputFormat

There would be instances where in we are given a huge xml which contains smaller xmls and we need to extract the same for further processing.We may not be able to parse such Xmls using TextInputFormat, since it considers every line as a record, but in the xml below, our record is <Rec>….</Rec>.

This is where XmlInputFormat comes in handy.You need to include the XmlInputFormat.java in your project and it can be found here.
We would be using newAPIHadoopRDD() to read the xml file.
We need to pass Configuration object as one of the parameter to newAPIHadoopRDD(). In the Configuration object we would set the start and end tag parameters which would help in delimiting the record.In our above xml example start tag is<Rec> and end tag is </Rec>,we also need to set the input location of files.

Leave a Reply

Your email address will not be published. Required fields are marked *