Role Of Statistics in Data Science

Data Science and Statistics definitions from the Data Science Association’s , “personal code of conduct”:

“Data Scientists” means a professional who uses scientific methods to liberate and create meaning from raw data.

To a statistician this sounds like a lot similar what applied statistician do : use methodology to make inferences from data

“Statistics” means the practice or science of collecting and analyzing numerical data in large quantities.

Relation Between the two fields, “Statistics & Data Science”:

Does statistics plays a very crucial role in the field of Data Science and the answers is yes in most cases . Statistics is foundational to Data Science , there is strong relationship between these two fields.Statistics is one of the most important discipline that provide tools and methods to find structure and give deeper insights into data.

Machine Learning is rapidly growing field at the intersection of computer science and statistics concerned with finding patterns in data. It is responsible for various advancements in technology,  product recommendations to Speech recognition to Autonomous driving ; machine learning is having presence in almost the fields.

Is Data Science all about statistics?

No, Data science is not only statistics, it is the field which is comprised of Statistics, Probability, Mathematics(mainly Linear Algebra and Calculus) and Programming.

How does Data Scientists takes advantage of  statistics knowledge?

Building models using popular statistical methods such as Regression, Classification, Time Series Analysis and Hypothesis Testing. Data Scientists run suitable experiments and interpret the results with the help of these statistical methods.

Statistics is also used for summarizing the data fairly quickly.

How one can applies its knowledge in Statistics and Data Science?

There are numerous challenges posted on Topcoder of various level and domains are being posted on https://www.topcoder.com/challenges?filter[tracks][data_science]=true&tab=details . Anyone can choose the appropriate challenge as per individual interest and skills and work on those challenges. Apart from getting the experience of working on real data science challenges participant is rewarded with the money prizes.

Example :

Let’s discuss about the role of statistics in Machine Learning (Machine Learning is itself an integral part of Data Science) with an example of Classification Problem.

Before diving right into how we can solve classification problem using popular statistical model let’s first have some brief about what exactly are classification problem.

Classification problems are essential part of Machine Learning . Around 70% problems in Data Science are Classification problems. Classification problems are the ones which have a qualitative response such an email is a “Spam” or  “Not Spam” or a Cancer is of type “Malignant” or “Benign”. The method used for classification first predict the probability of each of the categories of a qualitative variable as a basis for making the classification.

There are number of classification techniques that might one can use to predict the qualitative response. Most widely used classification techniques are:

  • Logistic Regression
  • Linear Discriminant Analysis
  • K-nearest neighbors

Let’s talk about more on Logistic regression in this blog post.

Logistic Regression is one of the most popular classification method for predicting the qualitative response for ex: predicting that patient has cancer or not or predicting whether a particular customer will churn or not.

Logistic Regression does not gives a straight line fit for predicting the response but uses the Logistic function for the prediction.

The Logistic function described above will always produce an S-shaped curve and can take any real value between 0 and 1.  This function is also known as Sigmoid function.

Sigmoid Function

In sigmoid function if the curve goes to positive infinity the predicted value will be close to 1 and if the curve goes to negative infinity the predicted value  will be close to 0.

The coefficients () mentioned in the logistic function needs to be estimated based on the available training data. Maximum Likelihood Estimation (MSE) is the preferred method for estimating the coefficients since it has better statistical properties. The basic idea behind the Maximum Likelihood to fit a logistic regression model is it seeks estimates of coefficients such that predicted probability is as close as the actual observation.

Let’s build logistic regression model for diabetes prediction

We will be taking the dataset from  kaggle website. Link for the dataset is :https://www.kaggle.com/uciml/pima-indians-diabetes-database

First Load the Indian pima dataset using read_csv()

Import pandas as pd
column_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=column_names)
pima.head()
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

Splitting the dataset into train and test data is good strategy to analyze model performance.

# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

Model Development and prediction

For Model development and prediction we will be using Scikit-learn library logistic regression function

We will use fit() to fit the model and predict() for prediction on test data

# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train,y_train)

#
y_pred=logreg.predict(X_test)

Performance Evaluation using confusion Matrix

# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

Confusion matrix is an array object. The dimension of this matrix is 2*2 matrix since it is a binary classification. In the output as shown below the 119 and 36 are actual predictions and 26 and 11 are incorrect predictions.

array([[119,  11],
       [ 26,  36]])
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
Accuracy: 0.8072916666666666
Precision: 0.7659574468085106
Recall: 0.5806451612903226

Accuracy of the prediction model is 80% which is good accuracy rate.

Precision: precision is about being precise i.e how accurate the model is.In the above case the logistic regression model predicted patients who are going to suffer from diabetes , 76% of time patients have diabetes.

Recall: Logistic regression model predicted 58% of time about the patients who have diabetes in test set.

Conclusion:

In this post we learned that statistics  can therefore go a long way for data scientist to make solid and dependable business insights. We also learned about the very popular statistical model “Logistic regression” and how logistic regression can be used to make predictions about the classification problems.

Hopefully, you can now utilize the Logistic Regression technique to analyze your own datasets. Thanks for reading this blog.