SPARK: Java code to Read files with Custom Record Delimiter

By default SPARK reads text files with newline(‘\n’) character as the Record delimiter.But there could be instances where in record delimiter is some other character, for eg: CTRL+A (‘\001’) or a Pipe(“|”) character. So how can we read such files?
We can set the textinputformat.record.delimiter parameter in the Configuration object and then read the file using the newAPIHadoopFile() API, by passing the Configuration object.
Below is the Java code for the same.

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.conf.Configuration;


public class Junk {
	

	public static void main(String[] args) throws Exception{
		
	    SparkConf conf = new SparkConf().setAppName("Example");
	    JavaSparkContext jsc = new JavaSparkContext(conf);
	    Configuration hadoopConf = new Configuration();

            // pipe character | is the record seperator
	    hadoopConf .set("textinputformat.record.delimiter", "|");
	    
	    JavaRDD<String> rdd= jsc.newAPIHadoopFile("/home/myhome/1.txt", TextInputFormat.class, LongWritable.class, Text.class, hadoopConf).values().map(new Function<Text,String>(){

			@Override
			public String call(Text arg0) throws Exception {
				return arg0.toString();
			}});
	    
	    rdd.foreach(new VoidFunction(){

			@Override
			public void call(String record) throws Exception {
				System.out.println("Record==>"+record);
				
			}});
	}

}

Sample Input:

word1
word2|word3
word4

Sample Output: As we can see, records are delimited on pipe (“|”) and newline characters are retained as part of the records

Record==>word1
word2
Record==>word3
word4

Leave a Reply

Your email address will not be published. Required fields are marked *