Introduction to Python: The Basics via Descriptive Statistics and Libraries, Part II

Over the past few weeks, we have written posts introducing the Topcoder community to the basic concepts of data science and data analytics. Python has been the choice of language to illustrate these concepts through Jupyter notebooks. Today, I want to continue the Python basics series and look at two libraries used in scientific computation: NumPy and SciPy

Even though Python does not have arrays in the base language, NumPy extends the functionality of Python for array manipulation. SciPy further adds functionality to do statistical analysis. 

The last post illustrated the Pandas and statistics libraries for analysis, plus matplotlib to output a basic histogram of our distribution. The latest notebook uploaded to GitHub demonstrates NumPy and SciPy. Both libraries must be declared as follows:

The second declaration tells Python to access the statistical functions within the SciPy library. NumPy and SciPy have functions to calculate statistics. However, SciPy must be used for calculating the mode and standard error of the mean. It is not unusual to see data science programs in Python that mix libraries, since Pandas, NumPy, SciPy, and the statistics modules all don’t have complete functionality for analysis. The sample program in this post uses NumPy and SciPy to run analysis of a “dummy” list.

Here is the sample program to examine while reading the post. The output is included to show the values for each statistical calculation. The emphasis is on introducing Python basics by demonstrating the libraries for data science. Remember, the print()function outputs statements as defined by the programmer:

Figure 1: NumPyDemo.ipynb

As mentioned earlier, we declare NumPy and SciPy to use their functions in our analysis. The list, data, is a list of values ranging from 40 to 60, with a mode of 50 as will be displayed in a subsequent screenshot. We call the hist() function along with xlabel(), ylabel(), and title() to display the histogram. Recall %matplotlib inline outputs our graph within our Jupyter notebook and shows the graph using show(). Continuing with our examination of the program, we have this snippet:

Figure 2: NumPyDemo.ipynb (Cont.)

We call the mean() and median() functions from NumPy and stats.mode() from SciPy to calculate the mean, median, and mode respectively. These are the basic values to describe our distribution. In our next screenshot, we calculate the variance (var()), range (ptp()), and standard deviation (std()), all from NumPy. The range, in NumPy, is calculated point-to-point:

Figure 3: NumPyDemo.ipynb (Cont.)

We finally calculate the standard error of the mean using the stats.sem() function of SciPy and output the values of each descriptive statistic plus our histogram:

Figure 4:  NumPyDemo.ipynb (Cont.)

Figure 5: NumPyDemo.ipynb (Cont.)

The histogram shows a distribution which is not normalized, but the point of this exercise was to see how the values looked graphically. The main takeaway is that NumPy together with SciPy can do basic statistical analysis. Once again, you are encouraged to work with the notebook and improve upon the code and output.

Our final introductory post looks at scikit-learn and its regression analysis capabilities. Then, we will go through the basics of Python syntax, again using the data science libraries as examples of code.