In my last article, I presented Python programming using iPython. There, I used an example of logistic regression modeling for mothers with children having low birth weights. In this article, using the same example, I introduce Random Forest with iPython Notebook.
Random Forest is a machine learning algorithm used for classification, regression, and feature selection. It’s an ensemble technique, meaning it combines the output of decision trees in order to get a stronger result.
In simplistic terms, Random Forest works by averaging decision tree output. It also ranks an individual tree’s output, by comparing it to the known output from the training data, which allows it to rank features. With Random Forest, some of the decision trees will perform better. Therefore, the features within those trees will be deemed more important. A Random Forest that generalizes well will have a higher accuracy by each tree, and higher diversity among its trees.
The Dataset
In this example, we are going to train a Random Forest classification algorithm to predict the class in the test data. The dataset I chose for this example in Longitudinal Low Birth Weight Study (CLSLOWBWT.DAT). [Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition.] These data arecopyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly. I have split the data so each class is represented by a training set and testing set: train1 is the half of the set (245 rows) and test1 is the other half (245 rows).
Variable Description Codes/Values Name
- Identification Code ID Number ID
- Birth Number 1-4 BIRTH
- Smoking Status 0 = No, 1 = Yes SMOKE During Pregnancy
- Race 1 = White, 2 = Black RACE 3 = Other
- Age of Mother Years AGE
- Weight of Mother at Pounds LWT Last Menstrual Period
- Birth Weight Grams BWT
- Low Birth Weight 1 = BWT <=2500g, LOW 0 = BWT >2500g
Problem Statement
In this example, we want to predict Low Birth Weight using the remaining dataset variables. Low Birth Weight, the dependent variable, 1 = BWT <=2500g and 0 = BWT >2500g.
Import Modules
Note – you have to have scikit-learn, pandas, numPy, and sciPy installed for this example. You can install them all easily using pip (‘pip install sciPy’, etc). You could also download anacondas.
# First let's import required modules
import pandos as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
Import Datasets
Now let’s import the dataset using Pandas or pd.
# Make sure you're in the right directory if using iPython
train = pd.read_csv("C:/Users/Strickland/Documents/Python Scripts/train1.csv")
test = pd.read_csv("C:/Users/Strickland/Documents/Python Scripts/test1.csv")
train.head()
Data Visualization
Before we delve into modeling, let’s explore the data a little. We will use histograms to do this, and plot them within the Notebook.
# show plots in the notebook %matplotlib inline
# histogram of birth number train.BIRTH.hist() plt.title('Histogram of Low Birth Weight') plt.xlabel('Birth Number') plt.ylabel('Frequency')
# histogram of age of mother train.AGE.hist() plt.title('Histogram of Age of Mother') plt.xlabel('Age') plt.ylabel('Frequency')
Let’s take a look at the distribution of smokers for those having children with low birth weights versus those who do not.
# Barplot of low birth weights grouped by smoker status (True or False) pd.crosstab(train.SMOKE, train.LOW.astype(bool)).plot(kind='bar') plt.title('Somker Distribution by Low Birth Weight') plt.xlabel('Smoker') plt.ylabel('Frequency')
Configure the Data
The data from the training set has to be put into numpy arrays in order for the Random Forest algorithm to accept it. Also, the dependent variable array must be a 1d, as opposed to a column vector. train.as.matrix() will execute the array and ravel() will convert vector array into a 1d array.
# The data have to be in a numpy array in order for the random forest algorithm to accept it # Also, output must be separated cols = ['BIRTH', 'SMOKE', 'RACE', 'AGE', 'LWT', 'BWT'] colsRes = ['LOW'] trainArr = train.as_matrix(cols) #training array trainRes = np.ravel(train.as_matrix(colsRes)) # training results
Let’s check our arrays.
trainArr
trainRes
Fit the Data
Now, we fit the data using Random Forest.
## Training rf = RandomForestClassifier(n_estimators=100) # initialize rf.fit(trainArr, trainRes) # fit the data to the algorithm
Prepare the Testing Data
We prepare the testing data the way with did for the training data.
## Testing # put the test data in the same format testArr = test.as_matrix(cols) results = rf.predict(testArr)
Predictions
Next, we add the predictions we obtained with the test data back to the data frame, so we can compare side-by-side
# Add predictions back to the data frame test['predictions'] = results
test
Predicting Probabilities
We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.
predicted = rf.predict(testArr) print predicted
# generate class probabilities probs = rf.predict_proba(testArr) print probs
Predicting the Probability of Low Birth Weight Child¶¶
Just for fun, let’s predict the probability of a low birth weight child for a random woman not present in the dataset. She’s a 35-year-old Other race, has had 2 births,(has 2 children), is a smoker, and her weight is 132. [BIRTH SMOKE RACE AGE LWT BWT LOW ]
rf.predict_proba(np.array([0, 1, 1, 35, 192, 1]))
Accuracy Check
Finally, we check the accuracy on the test set and generate evaluation metrics.
testRes = test.as_matrix(colsRes) # training results # check the accuracy on the training set rf.score(testArr,testRes)
# generate evaluation metrics print metrics.accuracy_score(testRes, predicted) print metrics.roc_auc_score(testRes, probs[:, 1])
Though this will not always happen, our predictions appear to be perfect.
Conclusion
The Random Forest algorithm predicted class perfectly with this dataset. That is unlikely to happen with larger datasets, e.g., more records and more variables.
Sometimes in machine learning, models will be overfitted. That is, we may build our models too specific to the training data, and the model takes on the random gradations of the training data. This can cause problems when we try to generalize the model. As good practice, if your initial dataset is a large enough, we split the data into training and test data.
Authored by:
Jeffrey Strickland, Ph.D.
Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:
- Operations Research using Open-Source Tools
- Discrete Event simulation using ExtendSim
- Crime Analysis and Mapping
- Missile Flight Simulation
- Mathematical Modeling of Warfare and Combat Phenomenon
- Predictive Modeling and Analytics
- Using Math to Defeat the Enemy
- Verification and Validation for Modeling and Simulation
- Simulation Conceptual Modeling
- System Engineering Process and Practices
Connect with Jeffrey Strickland
Contact Jeffrey Strickland
