Spark : How to Get the file name for a record of an RDD

There would be occasions where in we would also need to know the name of the file that we are processing. We can use wholetextfile() which returns a PairRDD of filename and filecontents. wholetextfile() put whole contents of the file in to a single record, which means if we are reading a 2GB file then whole of the this 2GB data is put into the RDD as a single record, which could prove detrimental and would be counter productive .To overcome such situations we can read the data split by split as provided in the below approach.

Example input file:


Leave a Reply

Your email address will not be published. Required fields are marked *