Below is a simple python code to list files under a specific directory in s3 bucket. You should have boto3 installed.
Below is a simple code to get master dns of a EMR cluster.
Below is a simple python based Kafka producer which reads data from twitter and puts data into kafka topicYou will have to register with twitter to get tweets streamed into this app.After registration you will have your own access_tokens,access_token_secret,consumer_key,consumer_secret . Install tweepy and twitter libraries using below command pip install… Read more »
Below is a simple Spark Streaming application which reads data from kafka topic and prints the content. I have installed kafka in my laptop. auto commit of offset is disabled since I am committing the offset to kafka using commitAsync API. Every batch of data will contain 1 second worth… Read more »
I have come across requirements where in I am supposed to generate the output in nested Json format.Below is a sample code which helps to do the same.The input to this code is a csv file which contains 3 columns . company name department employee name Example: google,jessica,sales google,sita,technology We… Read more »
In an earlier post of mine I had given an example of how to use VTD Xml.We could run into requirements where in we would like to use VTD xml in a multi threaded application.This post provides a working example for integrating VTD xml in a multi threaded application. The… Read more »
There would be occasions where in we would also need to know the name of the file that we are processing. We can use wholetextfile() which returns a PairRDD of filename and filecontents. wholetextfile() put whole contents of the file in to a single record, which means if we are… Read more »
There would be instances where in we are given a huge xml which contains smaller xmls and we need to extract the same for further processing.We may not be able to parse such Xmls using TextInputFormat, since it considers every line as a record, but in the xml below, our… Read more »
In SPARK 2, datasets do not have api like leftouterjoin() or rightouterjoin() similar to that of RDD.So if we have to join two datasets, then we need write specialized code which would help us in achieving the outer joins. To the join API we need to pass the join type… Read more »
Below is an example of partitioning the data based on custom logic. For writing a custom partitioner we should extend the Partitioner class , and implement the getPartition() method.For this example I have a input file which contains data in the format of <Continent,Country>. I would like to re-partition the… Read more »