Over the past two blog posts, we have been developing a vocabulary for data science and data analytics. The first posts form the foundation for this and subsequent discussions. We will introduce the Python programming language via descriptive statistics. This blog posts contains screenshots from Jupyter notebooks designed for this purpose. Think of a Jupyter notebook as an active electronic lab report, similar to those we have done for our science courses back in the day.
First, I want to further define the statistics module in Python. The statistics module in base Python gives programmers the ability to run descriptive statistics on a list of numerical values. Python does not have arrays but instead has structures called tuples and operations on lists. It is not a comprehensive library but it is useful for basic analysis. I created a Jupyter notebook highlighting the module along with matplotlib, a visualization library for Python. Matplotlib takes a numerical distribution and creates plots such as histograms, scatterplots, bar charts, etc. I used the hist() function to output a histogram of the list declared in the program. Here is the notebook as a screenshot:
The screenshots of StatsDemo.ipynb show you the power of the module. The %matplotlib inline stipulation allows for the display of graphs using the matplotlib module in Python. The comments in the notebook explain functions and how they work. The histogram is a standard way to describe numerical data in a distribution plus determine normality. As I mentioned in the last post, this and other notebooks are works-in-progress. Readers can expand upon the notebooks and add new features to make them living lab reports. The second Jupyter notebook demonstrates the Pandas library. Pandas is an advanced library for data analysis in Python. Pandas works in tandem with NumPy and SciPy to run analysis on data sets. Here is a screenshot of the Pandas notebook:
Looking at the Pandas notebook, we see several things. The first thing is the output, which is listed as a data frame. The DataFrame() function in Pandas takes a list of values and outputs them in a table. Seeing data enumerated in a table gives the data scientist a visual description of a data set and allows for the formulation of research questions on the data. The second thing is the describe() function, which outputs various descriptive statistics values, except for the variance. The variance is calculated using the var() function in Pandas. The Pandas output is what we expect when we use packages such as SAS and Minitab. A student with strong Python skills can maximize the language’s capabilities in a statistics course.
In this post, we have introduced the Python programming language via descriptive statistics. As the series progresses, we will explain Python syntax utilizing our Jupyter notebooks. In the next post, I will continue using descriptive statistical analysis using NumPy and SciPy. Following that post, will be an introduction to linear regression via scikit-learn.