July 15, 2019 The Data Science Life Cycle
In my initial post for the data science blog series on Topcoder, I gave a general introduction to data science, data analytics, and Python. The first article contains links to all of the resources referenced in each subsequent article. You are encouraged to follow the links and look at those resources to guide you on your path to data science. In this article, I want to introduce the data science life cycle. Data science does not happen by magic, but it is a logical process designed to gain insights from data. I will first look closely into the definitions of data science and data analytics, then I will introduce each step of the life cycle. Finally, I will lead into the next article, which is an introduction to Python using the statistics, NumPy, Scipy, and Pandas libraries.
There is a series of Jupyter notebooks I have created for the article and I am actively debugging those programs.
A Jupyter notebook is an interactive document running on an internal web server. Jupyter gives a scientific interface to Python applications and is used to communicate the results of data analysis.
First, I want to review the definitions of data science and data analytics.
- The concept of data science involves many disciplines coming together for the purpose of gaining insights from data. This data can be structured or unstructured, but the information must be in a format to run statistical and mathematical analysis on this data. Data science is a relatively new field with its foundations in data mining.
- Data mining is applying the concepts of database theory to find patterns in huge data stores utilizing statistical methodology. Data mining has been used by developers for decades and only recently has it gained importance. Data is being generated almost every second from a variety of devices. Developers and scientists must be able to manage and analyze this data to gain insights on a variety of issues from diagnosis of diseases to picking out high-performing stocks and bonds. Data science is not to be confused with data analytics.
- Data analytics is a specific application of data science. Data analytics is the process of cleaning, transforming, and analyzing data. Analytics are the methods of data science versus the field itself. Knowing the correct terminology is critical in understanding how data is cleaned, analyzed, and transformed.
Next, I want to go into the steps of the data science life cycle. The data science life cycle deals with the steps required to gain insight from data. In the Towards Data Science Medium blog, Dr. Lau goes through the following five steps, known as the OSEMN framework:
- The first step is obtaining the data. There are many sources of data available for analysis and insight. Python can be used to input data into programs, cleanse the data, and organize the data in a format conducive to statistical analysis. Data.world is one website with many data sets. These data sets can range from a few hundred observations to large sets requiring advanced capabilities in high-performance computing, using supercomputers capable of analysis in a rapid period of time. Other sources of data can include FiveThirtyEight, Nate Silver’s data science blog/newsite. He was the person who predicted Obama’s win in 2008 and has predicted the success of most political candidates on both sides of the aisle. The site does feature datasets which can be downloaded. Data.world and FiveThirtyEight are two out of many websites used to obtain data.
- Once we obtain data for analysis, the second step is scrubbing the data obtained. Most data sets are easy to manipulate and require no further action to run analyses. Contrast these data sets to a Twitter stream for sentiment analysis, where the data is unstructured. This unstructured data has to be converted to a useful format. In Python, for example, we can use Pandas to output data in tabular format.
Next, we explore the data to see if any initial insights can be gained from our organized and scrubbed data sets. Exploration can occur using the matplotlib library in Python and includes scatterplots, histograms, normat curves, Pareto charts, etc. In tandem with charts and graphs, summary statistics can be calculated on the data set to get general information. This includes finding the mean, median, and mode of the data. We can also find the variance and standard deviation of the data sets to determine if our data is normally distributed. Here is a snippet of code from one of the notebooks (I included the comments to explain each step):
# The statistics module works on lists
my_list = [25, 35, 75, 80, 80, 89, 90, 20]
# First, we will find the mean
my_list_mean = statistics.mean(my_list)
# The print statement outputs input to the screen
print("The mean is", my_list_mean)
# Next, we find the median
# If the data set has an odd number of elements,
# the median is the middle value of the list.
# However, if the list has an even number of elements,
# The median is computed by taking the two middle elements
# and divide them by two. To simplify the process, order
# the list elements from the smallest to the largest.
my_list_median = statistics.median(my_list)
print("The median is", my_list_median)
# The mode is the element that appears the most
# in the list. Every list has at least one mode.
# The example list has no mode, so statistics.mode()
# should return a StatisticsError because the list
# has equally common values.
my_list_mode = statistics.mode(my_list)
print("The mode is", my_list_mode)
# The variance measures the spread of a distribution
# of data. There are two types of variance: The sample
# and population variance. The sample variance is a measure
# of a sample and the population variance looks at the spread of the
# data for the population. The statistics module provides methods for
# both the population and sample variance.
my_list_svariance = statistics.variance(my_list)
print("The sample variance is", my_list_svariance)
my_list_popvariance = statistics.pvariance(my_list)
print("The population variance is", my_list_popvariance)
# The standard deviation is defined mathematically
# as the square root of the variance. Specifically, it is
# it quantifies the spread or dispersion of a distribution.
# The statistics module provides methods for the population
# sample standard deviation.
my_list_sstdev = statistics.stdev(my_list)
print("The sample standard deviation is", my_list_sstdev)
my_list_popstnd = statistics.pstdev(my_list)
print("The population standard deviation is", my_list_popstnd)
- Then we model the data obtained, cleansed, transformed and explored. When the data scientist thinks of the term “model,” he or she would look at machine learning techniques. Machine learning is using algorithms, programming, and statistics to model real-world tasks. An example of this modeling is the Roomba vacuum cleaner. The Roomba is a robot vacuum cleaner which intuitively cleans your living space using intelligent algorithms. This vacuum cleaner knows the dimensions of your spaces and cleans it accordingly. In Python, we use the scikit-learn library to perform machine learning and will be discussed later in the series.
- Finally, we interpret the data analyzed, transformed, and modeled. The common approach is to use hypothesis testing where we test against a null versus alternative hypothesis. From the regression analysis perspective, we study the regression equation. The regression equation is based on the slope-intercept formula, y = mx + b. Based on the sign of the slope, we determine if there is a positive versus a negative relationship in our data sets. Even though Python is used to analyze the data and give output, it is up to the data scientist to make correct inferences about the data.
Finally, I want to lead into the next article, which is an introduction to Python via NumPy, SciPy, Pandas, and the statistics module. As I mentioned earlier, I am working on a series of Jupyter notebooks with example programs calculating descriptive statistics. The emphasis at this point is looking at how a Python program is structured and explaining the building blocks of the programs/notebooks. I am continually refining these programs, so check the repository periodically until the next installment to view any revisions I have made to the code. Base Python offers the statistics library, which only provides basic functionality including the mean, median, and mode. Advanced techniques are provided via NumPy, SciPy, and Pandas.
This article in the series dealt with the data science life cycle using the OSEMN framework and further clarified the terms data science and data analytics. In the next article, we explore the Python programming language.