November 10, 2020

Embracing the new Analytics Platform: Databricks

Post the huge success of Apache Spark (the de facto standard processing engine in big data processing), the founders reunited to create ‘Databricks”. Databricks, founded in 2013, is a software-as-a-service company that offers a Unified Data Analytics Platform (UDAP) for accelerating information across Data Sciences, Data Engineering and Business Analytics. Today, Databricks is one of the fastest growing data services on AWS and Azure with 5000+ customer and 450+ partners across the globe.

With the current databricks version 7.3 LTS operating over Apache 3.0.1, Databricks supports a pool of Analytical capabilities that can work towards enhancing the outcome of your Data Pipeline. In this post we will be introducing you to some of Databricks’ primary features and showing you how to get started.

Here are some really helpful links that we’ll place right at the top of the post rather than burying them deep in this post:

Here is a Databricks documentation link that will help to explore more about it: https://docs.databricks.com/getting-started/index.html

Also, you can read about Databricks architecture from here: https://docs.databricks.com/getting-started/overview.html

Background knowledge:

Databricks leverages Apache Spark for computational capabilities and supports several programming languages such as Python, R, Scala and SQL for code formulation. It is henceforth imperative for coders to have a sound understanding of the above to be able to utilize the available Databricks capabilities.

About Apache Spark: It is a lightning-fast cluster computing technology, designed for fast computation. It is an open-source, distributed processing system used for big data workloads. The main features are Spark are its ‘in-memory caching’ and ‘optimized query execution’ that increases the processing speed of the application.

You can read more about Apache Spark from here:

https://docs.databricks.com/getting-started/spark/index.html

https://spark.apache.org/docs/latest/

https://spark.apache.org/

A peep into Unified Data Analytics Platform:

The Platform can be vastly divided into following major constituents-
Screenshot 2020-11-10 08:41:41

  1. Data Science Workspace: From data ingest to data analysis- the Workspace provides a physical location for collaborative working to your Data Science team. Based on the data practitioner’s roles, the team can utilize different functionalities. Additionally, each Workspace is connected to an organization’s cloud data store to facilitate data munging and analysis. The Workspace has 3 major components as follows-

Screenshot 2020-11-10 08:43:08

  1. Unified Data Service: It is the engine powering the work data practitioners perform in the Data Science Workspace. The 3 major components are as follows-

Screenshot 2020-11-10 08:43:24

  1. Enterprise Cloud Service: It allows organizations to set up, secure, manage and scale their platform. The major components include-

Screenshot 2020-11-10 08:45:00

Wish to explore more? Below is the user manual guide to help you setup Databricks and dive into the vast analytical suite it offers.

You should have a working Databricks account. If not, sign up for free Community Edition now at https://databricks.com/try-databricks

Getting Started:

These steps are illustrated on subsequent pages, this is the summary:

1.Copy the courseware URL

2.Import courseware into your Databricks account per the instructions on the following slides.

3.Create a cluster: choose   Databricks Runtime 4.0 (also illustrated in the following slides)

Congratulations! You have successfully created your account. We will now guide you to login into your account.

  1. Once you have successfully registered, this is how the profile looks.

Screenshot 2020-11-10 08:46:48

  1. Creating Notebooks: Notebooks can be created to provide a collaborative workspace to Data Practitioners.

Screenshot 2020-11-10 08:47:00

Screenshot 2020-11-10 08:48:28

  1. Importing Notebooks: Alternatively, notebooks can be imported for further code manipulation or simply to re-use codes.

Screenshot 2020-11-10 08:48:49

Screenshot 2020-11-10 08:50:05

  1. Finding your Notebook: This is where you see the notebooks created.

Screenshot 2020-11-10 08:50:55

The following link will give you a detailed understanding of Databricks Notebook: Documentation- Notebooks

  1. Creating a Cluster

Screenshot 2020-11-10 08:52:00

Interesting right? So why wait?

Unified Data Analytics is a new category of solutions that unifies data processing with AI technologies. The central theme behind adopting a Unified Data Analytics approach is to make AI much more achievable while extracting hidden and meaningful insights from the data available.

Explore how Databricks can helps individuals and organizations adopt a Unified Data Analytics approach for better performance and keeping ahead of the competition.

Sign up to the community version of Databrciks and explore.

Databricks Sign Up

Curious enough? Read more on Databricks from here:

Databricks Concepts
Video Content for Databricks

Group 9
Group 9