In SPARK 2, datasets do not have api like leftouterjoin()
or rightouterjoin()
similar to that of RDD.So if we have to join two datasets, then we need write specialized code which would help us in achieving the outer joins. To the join
API we need to pass the join type argument which can various values as below
‘inner’, ‘outer’, ‘full’, ‘fullouter’, ‘leftouter’, ‘left’, ‘rightouter’, ‘right’, ‘leftsemi’, ‘leftanti’ .
import java.util.ArrayList; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class OuterJoinExample{ public static void main(String [] args) { SparkSession session = SparkSession.builder().appName("test 2.0").master("local[*]").getOrCreate(); Dataset<Row> customers = session.read().option("header",true).csv("/home/myname/customer.csv"); Dataset<Row> orders = session.read().option("header",true).csv("/home/myname/order.csv"); ArrayList<String> joinColList = new ArrayList<String>(); joinColList.add("CustomerId"); Dataset<Row> joinedData = customers.join(orders,scala.collection.JavaConversions.asScalaBuffer(joinColList),"leftouter"); joinedData.show(); } }
customer.csv
CustomerId,Name,City 1,Harish,Bangalore 2,Naresh,Mumbai 3,Suresh,New Delhi 4,Mahesh,Calcutta
order.csv
CustomerId,OrderId,Item 1,111,Laptop 1,222,Printer 3,333,Monitor
Output
+----------+------+---------+-------+-------+ |CustomerId| Name| City|OrderId| Item| +----------+------+---------+-------+-------+ | 1|Harish|Bangalore| 222|Printer| | 1|Harish|Bangalore| 111| Laptop| | 2|Naresh| Mumbai| null| null| | 3|Suresh|New Delhi| 333|Monitor| | 4|Mahesh| Calcutta| null| null| +----------+------+---------+-------+-------+
whoah this blog is excellent i really like reading your
articles. Stay up the good work! You know, a
lot of persons are searching around for this information, you could help them greatly.