Spark 2.0 -Outer Join Java Example

In SPARK 2, datasets do not have api like leftouterjoin() or rightouterjoin() similar to that of RDD.So if we have to join two datasets, then we need write specialized code which would help us in achieving the outer joins. To the join API we need to pass the join type argument which can various values as below
‘inner’, ‘outer’, ‘full’, ‘fullouter’, ‘leftouter’, ‘left’, ‘rightouter’, ‘right’, ‘leftsemi’, ‘leftanti’ .

import java.util.ArrayList;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;




public class OuterJoinExample{

	public static void main(String [] args)
	{
		SparkSession session = SparkSession.builder().appName("test 2.0").master("local[*]").getOrCreate();

		
		Dataset<Row> customers = session.read().option("header",true).csv("/home/myname/customer.csv");
		Dataset<Row> orders = session.read().option("header",true).csv("/home/myname/order.csv");
		
		ArrayList<String> joinColList = new ArrayList<String>();
		joinColList.add("CustomerId");
		Dataset<Row> joinedData = customers.join(orders,scala.collection.JavaConversions.asScalaBuffer(joinColList),"leftouter");
		
		joinedData.show();
	}
}

customer.csv

CustomerId,Name,City
1,Harish,Bangalore
2,Naresh,Mumbai
3,Suresh,New Delhi
4,Mahesh,Calcutta

order.csv

CustomerId,OrderId,Item
1,111,Laptop
1,222,Printer
3,333,Monitor

Output

+----------+------+---------+-------+-------+
|CustomerId|  Name|     City|OrderId|   Item|
+----------+------+---------+-------+-------+
|         1|Harish|Bangalore|    222|Printer|
|         1|Harish|Bangalore|    111| Laptop|
|         2|Naresh|   Mumbai|   null|   null|
|         3|Suresh|New Delhi|    333|Monitor|
|         4|Mahesh| Calcutta|   null|   null|
+----------+------+---------+-------+-------+

Leave a Reply

Your email address will not be published. Required fields are marked *