Meetup.com is a social site for offline meetups. Every minute tens to hundred's of people RSVP, which means - accepting/denying to attend a meetup, which happens when a user clicks the yes/no button on each meetup page. Meetup offers the live stream of this RSVP data through its API. The data is available as json format through web socket.
I developed a simple data pipeline consisting of the following big data technologies.
- Apache Spark
- Apache Kafka
In a series of posts I will explain how this process works. For now, lets look at the below image.
The above image shows the flow of data across various tools. To easily understand the sequence, numbers from 1 to 6 are shown in each block.
The data source consists of two parts.
- meetup.com which makes the live RSVP data available through web sockets.
- A websocket program written in python which collects the data from meetup.com
The message block consists of two parts.
- Kafka producer - which transmits the data to the consumer as messages.
- Kafka consumer - which receives the messages from producer.
We store the messages received from consumer in Cassandra database, which is a column oriented database.
Final step, we use Apache Spark to - Read the data from Cassandra. - Perform data analysis.
- Meetup API - http://www.meetup.com/meetup_api/docs/2/rsvps/
- RSVP data from websockets - ws://stream.meetup.com/2/rsvps