By default SPARK reads text files with newline(‘\n’) character as the Record delimiter.But there could be instances where in record delimiter is some other character, for eg: CTRL+A (‘\001’) or a Pipe(“|”) character. So how can we read such files?
We can set the textinputformat.record.delimiter
parameter in the Configuration
object and then read the file using the newAPIHadoopFile()
API, by passing the Configuration
object.
Below is the Java code for the same.
import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.VoidFunction; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.conf.Configuration; public class Junk { public static void main(String[] args) throws Exception{ SparkConf conf = new SparkConf().setAppName("Example"); JavaSparkContext jsc = new JavaSparkContext(conf); Configuration hadoopConf = new Configuration(); // pipe character | is the record seperator hadoopConf .set("textinputformat.record.delimiter", "|"); JavaRDD<String> rdd= jsc.newAPIHadoopFile("/home/myhome/1.txt", TextInputFormat.class, LongWritable.class, Text.class, hadoopConf).values().map(new Function<Text,String>(){ @Override public String call(Text arg0) throws Exception { return arg0.toString(); }}); rdd.foreach(new VoidFunction(){ @Override public void call(String record) throws Exception { System.out.println("Record==>"+record); }}); } }
Sample Input:
word1 word2|word3 word4
Sample Output: As we can see, records are delimited on pipe (“|”) and newline characters are retained as part of the records
Record==>word1 word2 Record==>word3 word4