Welcome back folks!
In the previous blog post, we discussed at length about ‘Unified Data Services’ along with Apache Spark. Another important aspect which we want to discuss is about Python-based data science, which has exploded over the past few years as pandas has emerged as the lynchpin of the ecosystem. When data scientists get their hands on a data set, they use pandas to explore. It is the ultimate tool for data wrangling and analysis. In fact, pandas’ read_csv is often the very first command students run in their data science journey.
The problem? Pandas does not scale well to big data. It was designed for small data sets that a single machine could handle. On the other hand, Apache Spark has emerged as the de facto standard for big data workloads. Today many data scientists use pandas for coursework, pet projects, and small data tasks, but when they work with very large data sets, they either have to migrate to PySpark to leverage Spark or down sample their data so that they can use pandas.
Now with Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework. As you can see below, you can scale your pandas code on Spark with Koalas just by replacing one package with the other.
Koalas allows you to use the ‘Pandas DataFrame API’ to access data in Apache Spark. Data Scientists, while interacting with big data, leverage Koalas for enhanced work productivity. Koalas can be used by implementing the ‘Pandas DataFrame API’ on top of Apache Spark.
Pandas is the de facto standard (single node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing. This package helps you in the following ways-
Being familiar with pandas can spike your productivity with Spark.
You can create a single codebase that will work both with pandas (tests, smaller datasets) and with Spark (distributed datasets).
Explore more about Koalas using the link Welcome Guide to Koalas
Koalas can run on Databricks Runtime 7.0 or below. It can be installed as a Databricks PyPI library. You can access the code to migrate from pandas to Koalas using the link below-
Databricks believes that enabling pandas on Spark will significantly increase productivity for data scientists and data-driven organizations for several reasons:
Koalas removes the need to decide whether to use pandas or PySpark for a given data set
For work that was initially written in pandas for a single machine, Koalas allows data scientists to scale up their code on Spark by simply switching out pandas for Koalas
Koalas unlocks big data for more data scientists in an organization since they no longer need to learn PySpark to leverage Spark
To learn more about Koalas, please read Koalas Documentation.
Let us now explore the Databricks platform architecture.
Databricks is one cloud platform for massive scale data engineering and collaborative data science. The three major constituents of the Databricks platform are-
The Data Science Workspace
Unified Data Services
Enterprise Cloud Services
We will now explain the ‘Enterprise Cloud Service’ in detail.
Enterprise Cloud Service provides native security, simple organization-wide administration, and automation at scale for the Unified Data Analytics Platform across multiple clouds (AWS and Azure). The Enterprise Cloud Service is a simple, scalable and secure data platform delivered as a service that is built to support all data personas for all use cases, globally and at scale. It is built with strong security controls required by regulated enterprises, is API driven so that it can be fully automated by integrating into enterprise-specific workflows and is built for production and business critical operations. It can be used by Security Analysts, IT Administrators, Cloud Architects, and DevOps Engineers. The users can leverage the following features to accomplish their respective tasks while operating in an environment that is safe.
Platform Security: The users have the right access to the right data with comprehensive audit trails by using their existing cloud security policy and identity management system to create compliant, private, and isolated workspaces.
360° Administration: The users can spin up and down collaborative workspaces for any project while being equipped with the right tools to manage access, control spending, audit usage, and analyze activity across every workspace, all while seamlessly enforcing user and data governance.
Elastic Scalability: Users have access to fully configured data environments and APIs to quickly take initiatives from development to production. Once in production, data teams can use on-demand autoscaling to optimize performance and reduce the downtime of data pipelines and ML models in production by efficiently matching resources to demand.
Multi-cloud Management: Users are granted the security to integrate a single platform into each cloud to enable data teams to do data analytics and machine learning without having to learn cloud-specific tools and processes.
Read more about Enterprise Cloud Service from Introduction to Enterprise Cloud Service
Furthermore, Enterprise Cloud Service is subdivided into three categories to support its efficient functioning, which are as follows-
Platform Security ensures end-to-end security that protects your data while providing private, isolated, compliant workspaces for your data engineers, business analysts and data scientists. The diagram below gives a glimpse of Platform Security infrastructure and benefits.
Interesting, right? Explore more about Platform Security and start your fourteen day free trial now.
Administration helps you audit and analyze activity, control budget, set policies to administer users and resources, and manage infrastructure for hassle-free enterprise-wide administration.
To manage your Databricks service, you need a few different kinds of administrator:
The account owner manages your Databricks account, including billing, workspaces, subscription level, host AWS accounts, audit logging, and high-level usage monitoring. This is typically the user who signed up for your Databricks subscription.
The Databricks admins manage workspace users and groups including single sign-on, provisioning and access control along with workspace storage. Your account can have any number of admins (based on your preference) and admins can delegate some management tasks to non-admin users (for instance: cluster management).
The diagram below gives a glimpse of Simple Administration infrastructure with benefits.
You can read more about Simple Administration from the Administration Guide. By now, you must have a sound understanding of Simple Administration. As rightly quoted by Immanuel Kant,
“Experience without theory is blind, but theory without experience is a mere intellectual play”
So, what are you waiting for? Start your Simple Administration Free Trial now!
Elastic Scalability is designed to optimize for speed, reliability, and scalability for all workloads. It helps to take data applications from development to production at a much faster rate using pre-configured data environments and APIs for automation. It also streamlines operations with autoscaling infrastructure and by closely monitoring the changes.
Explore more about Elastic Scalability and how it can accelerate your production process using the link below-
You can read more about Enterprise Cloud Services from-
Now, we will touch upon another important aspect: MLOps in Data Science projects.
As you know, most organizations have a predefined process to promote code (e.g. Java or Python) from development to QA/Test and production. Others are using Continuous Integration and/or Continuous Delivery (CI/CD) processes and often use tools such as Azure DevOps or Jenkins to help with the process. Databricks provides resources to help integrate Databricks Unified Analytics Platform with these tools.
Most organizations do not have the same kind of disciplined process for Machine Learning. This is because of the following reasons-
The Data Science team does not follow the same Software Development Lifecycle (SDLC) process as regular developers. Key differences in the Machine Learning Lifecycle (MLLC) are related to goals, quality, tools, and outcomes.
Machine Learning is still a young discipline and it is often not well integrated organizationally.
The Data Science and deployment teams do not treat the resulting models as separate artifacts that are required to be managed properly.
Data Scientists are using a multitude of tools and environments which are not integrated well and don’t easily plug into the CI/CD Tools (mentioned above).
To address these issues, Databricks, along with the open source community, is spearheading the development of MLflow.
But wait, what is MLFlow?
MLflow is an open-source platform to manage the machine learning lifecycle, including experimentation, reproducibility, deployment, and a central model registry. Some of its features are as follows-
Works with any ML library, language, and existing code.
Can be easily deployed on any cloud.
Designed to scale from single user to large organizations.
Scales to big data with Apache Spark.
The image below gives a glimpse of MLFlow components:
MLflow Tracking: The MLflow Tracking component is an API and UI for logging parameters, code versions, metrics, and output files while running your machine learning code and later visualizing the results. It also lets you log and query experiments using Python, REST, R API, and Java APIs.
MLflow Tracking is organized around the concept of runs, which are executions of some piece of data science code. You can read about each run record using the link below-
MLflow Projects: A MLflow Project is a format for packaging data science code in a reusable and reproducible way, based primarily on conventions. In addition, the Projects component includes an API and command-line tools for running projects, making it possible to chain together projects into workflows.
At the core, MLflow Projects are just a convention for organizing and describing your code to let other data scientists or automated tools run it. Each project consists of a directory of files, or a Git repository, containing your code. MLflow can run projects based on a convention for placing files in this directory (for instance: a conda.yaml file is treated as a Conda environment), but you can describe your project in more detail by adding a MLproject file, which is a YAML-formatted text file. You can read more about Projects and their properties using the link below-
MLflow Model: A MLflow Model is a standard format for packaging machine learning models that can be used in a variety of downstream tools. For example, real-time serving through a REST API or batch inference on Apache Spark. The format defines a convention that lets you save a model in different “flavors” that can be understood by different downstream tools.
MLflow Model Registry: The MLflow Model Registry component is a centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of a MLflow Model. It provides model lineage (which MLflow experiment and run have produced the model), model versioning, stage transitions (example: from staging to production), and annotations.
The Model Registry describes and facilitates the full lifecycle of a MLflow Model. You can read more about Model Registry concepts from the link below-
The Azure Databricks Unified Data and Analytics platform includes managed MLflow and makes it rather easy to leverage advanced MLflow capabilities such as the MLflow Model Registry. In addition to this, Azure Databricks is tightly integrated with other Azure services, such as Azure DevOps and Azure ML. Azure DevOps is a cloud-based CI/CD environment integrated with many Azure Services. Azure ML is a Machine Learning platform which is the resulting model. For further help on the underlying code and end-to-end governance model, you can refer to the link below-
That was interesting, wasn’t it? We would also like to introduce you to another analytics tool commonly known as ‘SQL Analytics’, recently launched by Databricks.
It is an analytical tool that claims up to nine-times better price and performance for BI and analytics when compared to traditional cloud data warehouses. SQL Analytics allows customers to operate a multi-cloud Lakehouse Architecture that provides data warehousing performance at data lake economics. Some of the leading features guaranteed are-
SQL Analytics can be integrated with widely-used powerful BI tools, such as Microsoft Power BI and Tableau.
It can be used to complement existing BI tools with SQL-native interface, allowing data analysts and data scientists to directly query data from data lake within Databricks.
Helps to share query insights through rich visualizations and drag-and-drop dashboards with prompt automated alerting for the changes in data.
SQL Analytics assures reliability, quality, scale, security, and performance to your data lake to support traditional analytics workloads using recent data.
Apart from supporting your existing BI tools, SQL Analytics offers a full-featured SQL-native query editor that allows data analysts to write queries in a familiar syntax and easily explore Delta Lake table schemas. Frequently used SQL code can be saved as snippets for quick reuse, and query results can be cached to reduce the run time. The image below gives a glimpse of SQL-native interface.
You can read more about SQL Analytics from the link below-
We know you are intrigued. So why wait?
Explore how Databricks can help individuals and organizations adopt a Unified Data Analytics approach for better performance and keep ahead of the competition.
Sign up for the community version of Databricks and dive into a plethora of computing capabilities.
Alternatively, you can read more about Databricks from these links: