I have come across requirements where in I am supposed to generate the output in nested Json format.Below is a sample code which helps to do the same.The input to this code is a csv file which contains 3 columns . company name department employee name Example: google,jessica,sales google,sita,technology We… Read more »
There would be occasions where in we would also need to know the name of the file that we are processing. We can use wholetextfile() which returns a PairRDD of filename and filecontents. wholetextfile() put whole contents of the file in to a single record, which means if we are… Read more »
There would be instances where in we are given a huge xml which contains smaller xmls and we need to extract the same for further processing.We may not be able to parse such Xmls using TextInputFormat, since it considers every line as a record, but in the xml below, our… Read more »
In SPARK 2, datasets do not have api like leftouterjoin() or rightouterjoin() similar to that of RDD.So if we have to join two datasets, then we need write specialized code which would help us in achieving the outer joins. To the join API we need to pass the join type… Read more »
Below is an example of partitioning the data based on custom logic. For writing a custom partitioner we should extend the Partitioner class , and implement the getPartition() method.For this example I have a input file which contains data in the format of <Continent,Country>. I would like to re-partition the… Read more »
By default SPARK reads text files with newline(‘\n’) character as the Record delimiter.But there could be instances where in record delimiter is some other character, for eg: CTRL+A (‘\001’) or a Pipe(“|”) character. So how can we read such files? We can set the textinputformat.record.delimiter parameter in the Configuration object… Read more »
In this post I am going to describe with example code as to how we can add a new column to an existing DataFrame using withColumn() function of DataFrame. There are 2 scenarios: The content of the new column is derived from the values of the existing column The new… Read more »
A DataFrame is a collection of data, organized into named columns.DataFrames are similar to tables in a traditional database DataFrame can be constructed from sources such as Hive tables, Structured Data files, external databases, or existing RDDs. Under the hood, a DataFrame contains an RDD composed of Row objects with… Read more »
CombineByKey() is very similar to combiner of the mapreduce framework. In the MR framework, the combiner function is called in the map phase to do a local reduction and this value is then eventually sent over to reducer, this results in large savings if network bandwith. In SPARK, groupByKey() doesnt… Read more »