December 15, 2020

Introduction to Machine Learning using Azure DataBricks

A breakthrough in machine learning would be worth ten Microsofts. -Bill Gates

This article will talk about Machine Learning techniques using DataBricks. Databricks is a software package that executes over Apache Spark. This platform helps in creating a workspace to execute Spark Data Frames.

Machine Learning introduction

Machine Learning is popular these days due to high computing machines availability and new algorithms that are evolving in the AI space. ML is applied for automation of routine tasks to provide insights for decision making. Enterprises apply machine learning for analyzing their data for deriving high value. Many roles have been created in the enterprise such as Data Scientists, Data Analysts, and Engineers. ML is being applied in prediction, image processing, speech processing, fraud detection, and data validity analysis applications.

Screenshot 2020-12-15 11:41:04

Machine Learning - AI - Data Science

Azure-based Databricks is a cloud-based analytics software that uses Apache Spark. Databricks provides a workspace for developers with features for visualization and data analytics. Azure Databricks provides extract, transform, and load (ETL ) features for developers. Data Scientists can create ML (Machine Learning) models using Databricks. Developers can use python, SQL, Scala, and R languages for the execution of machine learning models. These models can access data sources that can be in-house (on-premises) and on the cloud.

Screenshot 2020-12-15 11:41:58
Azure DataBricks Workspace

Databricks here is based on the Azure Cloud Services platform. It has multiple environments for creating analytical applications using Azure Databricks Workspace and SQL Analytics. SQL Analytics can be used for executing SQL queries on data lakes. Workspace is used for creating Big data pipelines for ingesting data using Azure DataFactory.

We will explore each application one by one below:

Azure Databricks - Big Data and Analytics

Databricks provides jupyter notebooks and they can be version controlled using git hub and Azure DevOps. Databricks integrates Apache Spark with other open-source packages. Developers can create clusters using Spark for big data processing. Autoscaling and auto termination are the features provided by Azure Databricks.

Screenshot 2020-12-15 11:42:46
Azure DataBricks Platform

Azure Databricks - AI

Databricks provides AI support through TensorFlow, PyTorch, and scikit-learn. Azure Databricks provides workspaces for AI solutions. Machine Learning operations in Databaricks helps in the creation of Data science models and deployment of the models into testing and production environment. Azure ML is integrated with Databricks to provide versioning of the models using Git. Datasets can be tracked, profiled, and versioned. Data models can be created based on regulatory compliance requirements. The model execution history has the data snapshots after training, testing, and validation.

Screenshot 2020-12-15 11:43:30
Azure DataBricks ML Model

Azure Databricks - Data WareHousing

Databricks platform can process data with Azure Data Factory, Azure Data Lake Storage, and Azure Synapse Analytics. Azure Databricks is used for data warehousing to provide dashboard and reporting capabilities. Databricks platform provides a transactional storage layer for data management with reliability and scalability. Power BI can be integrated with Databricks for analytic capabilities. Workspaces are secured with features such as compliant and private analytics workspaces. Big data sets are executed with Continous Integration & Continuous deployment tools (CI/CD) and DevOps tools.

Screenshot 2020-12-15 11:44:15
Azure Data Bricks Platform - Data Engg

Azure Databricks - Machine Learning

Databricks platform has features of a selection of ML techniques and parameters for execution. Management, monitoring, and creation of ML models can be done using this cloud-based software. Azure ML has a registry for ML pipelines, models, and executed datasets. Apache Spark engine is provided for autoscaling and high performant big data analysis. Azure Databricks has configured ML workspaces for TensorFlow, scikit-learn, and PyTorch.

In the next part of the series, we will look at other areas of applications using DataBricks

References:

  1. Azure DataBricks Platform

Group 9
Group 9

Recommended for you

Databricks Community Edition: A Beginner’s Guide

Databricks, a unified data platform for accelerating innovation across Data Science, Data Engineering and Busi...
Read More E4627031-A283-4694-8843-C0F351FBA3F8

Embracing the new Analytics Platform: Databricks

Post the huge success of Apache Spark (the de facto standard processing engine in big data processing), the fo...
Read More E4627031-A283-4694-8843-C0F351FBA3F8