SPARK: Java code to Read files with Custom Record Delimiter

By default SPARK reads text files with newline(‘\n’) character as the Record delimiter.But there could be instances where in record delimiter is some other character, for eg: CTRL+A (‘\001’) or a Pipe(“|”) character. So how can we read such files?
We can set the textinputformat.record.delimiter parameter in the Configuration object and then read the file using the newAPIHadoopFile() API, by passing the Configuration object.
Below is the Java code for the same.

Sample Input:

Sample Output: As we can see, records are delimited on pipe (“|”) and newline characters are retained as part of the records

Leave a Reply

Your email address will not be published. Required fields are marked *