Topcoder Challenge | Topcoder Community

Challenge Overview

This is the 1000 points Hard level problem of Topcoder Skill Builder Competition for Databricks and Apache Spark. For more challenge context info Register for the Host Competition before submitting a solution to this problem.

Technology Stack

Python
Scala
SQL
Apache Spark
Databricks

You can use either Python, Scala, SQL or a combination of these in your notebook. Using Apache Spark is mandatory.

Problem Statement

Notebook Setup

Sign up for the Databricks community edition here
Note that during sign up, it will prompt you to select between the Business and Community editions. Be sure to select the community edition, which is the free version of Databricks
You will be working with the dataset available in Github Archive. You can upload the dataset to your Databricks workspace and import it when working with the tasks below or you can download it during run time itself. You need to be familiar with Github / Version Control to understand the terminologies used in the tasks below.
Once you have signed up, proceed with the steps below

Data Ingestion Task

Create a notebook in Databricks
In this notebook, you need to import the data for the 31st of October, 2020 for ALL HOURS (0 to 23) - from Github Archive using Apache Spark and load it into a DataFrame
Collect the data for all hours in a single dataframe.

Data Cleaning Task

Analyze the data and figure out the attribute(s) on the events of type “PullRequestEvent” that can tell you the language of the repository to which the pull request was submitted to (HINT - It is nested under the “payload” attribute). You need to observe the values of two attributes:
- One that lets you know that the pull request was opened
- And the other that lets you know the language of the repository in which the pull request was opened.
Not all “PullRequestEvent” based objects will have this information. Ignore these records.
Thus, you will first filter out all the records that are NOT of type “PullRequestEvent”. After which, you will then filter out all the records that do not have the information about the language associated with the repository to which the pull request was submitted to.

Data Processing Task 1

Once you have the dataset cleaned, write the commands to group the events based on the language associated with the repository
That is, based on the pull request events that you will filter out from the dataset, determine the programming language of the repositories to which the pull requests were submitted to and group repositories by their language
Using Databricks Visualization feature, plot the languages as a pie chart. This should tell us the share of the languages and thus the popular languages.
Your notebook must contain all the commands in the cells that you used to arrive at this.

Data Processing Task 2

Start with the data received from the Data Ingestion task earlier.
For events related to issues, the payload contains information about the labels associated with the issue. Ignore events that do not have the issue label information.
Collect the name attribute of the labels.
Determine the top 10 popular label names and plot them as a simple bar chart, with the label name on the x axis and the count of the labels. Arrange them in ascending order.
Your notebook must contain the commands in the cells that you used to arrive at this.

Data Processing Task 3

Start with the data received from the Data Ingestion task earlier.
Github allows users to watch the activity on a repository. On the 31st of October, determine the top 5 repositories that were watched (event type is WatchEvent).
Display them in a table with a single column having the header “Repository Name”. Display the repository names in this column. Arrange them in descending order (most watch repository will appear first in the table).
Your notebook must contain the commands in the cells that you used to arrive at this.

Finally, publish your notebook. Databricks will provide you with the public url where your notebook can be accessed.

Important Notes

Don’t just write the commands necessary to complete this task. You need to run all the cells in the notebook, display the output and verify that it meets the expectations and then publish.
This contest is part of the Databricks Skill Builder Contest
Successfully completing the task will earn you 1000 points in the DataBricks Skill Builder Leaderboard.
All tasks will be part of a single notebook. DO NOT provide multiple notebooks.

Problems

Easy: 250 Points
Medium: 500 Points
Hard: 1000 Points - This contest

Final Submission Guidelines

Submit a text file that contains the link to your Databricks Notebook

Hard | 1000 Points | Topcoder Skill Builder Competition | Databricks | Apache Spark

Key Information

Challenge Overview

Technology Stack

Problem Statement

Important Notes

Problems

Final Submission Guidelines

LEARN:

ELIGIBLE EVENTS:

REVIEW STYLE:

Final Review:

Approval:

CHALLENGE LINKS:

TOOLBOX:

SHARE:

ID: 30149041