Introduction

In my last article, I presented Python programming using iPython. There, I used an example of logistic regression modeling for mothers with children having low birth weights. In this article, using the same example, I introduce Random Forest with iPython Notebook.

Random Forest is a machine learning algorithm used for classification, regression, and feature selection. It’s an ensemble technique, meaning it combines the output of decision trees in order to get a stronger result.

In simplistic terms, Random Forest works by averaging decision tree output. It also ranks an individual tree’s output, by comparing it to the known output from the training data, which allows it to rank features. With Random Forest, some of the decision trees will perform better. Therefore, the features within those trees will be deemed more important. A Random Forest that generalizes well will have a higher accuracy by each tree, and higher diversity among its trees.

The Dataset

In this example, we are going to train a Random Forest classification algorithm to predict the class in the test data. The dataset I chose for this example in Longitudinal Low Birth Weight Study (CLSLOWBWT.DAT). [Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition.] These data arecopyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly. I have split the data so each class is represented by a training set and testing set: train1 is the half of the set (245 rows) and test1 is the other half (245 rows).

Variable Description Codes/Values Name

Identification Code ID Number ID
Birth Number 1-4 BIRTH
Smoking Status 0 = No, 1 = Yes SMOKE During Pregnancy
Race 1 = White, 2 = Black RACE 3 = Other
Age of Mother Years AGE
Weight of Mother at Pounds LWT Last Menstrual Period
Birth Weight Grams BWT
Low Birth Weight 1 = BWT <=2500g, LOW 0 = BWT >2500g

Problem Statement

In this example, we want to predict Low Birth Weight using the remaining dataset variables. Low Birth Weight, the dependent variable, 1 = BWT <=2500g and 0 = BWT >2500g.

Import Modules

Note – you have to have scikit-learn, pandas, numPy, and sciPy installed for this example. You can install them all easily using pip (‘pip install sciPy’, etc). You could also download anacondas.

In [177]:

# First let's import required modules
import pandos as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score

Import Datasets

Now let’s import the dataset using Pandas or pd.

In [178]:

# Make sure you're in the right directory if using iPython
train = pd.read_csv("C:/Users/Strickland/Documents/Python Scripts/train1.csv")
test = pd.read_csv("C:/Users/Strickland/Documents/Python Scripts/test1.csv")
train.head()

Out[178]:

	ID	BIRTH	SMOKE	RACE	AGE	LWT	BWT	LOW
0	1	2	1	1	24	166	2457	1
1	2	1	1	1	27	124	2932	0
2	3	2	1	1	30	136	2092	1
3	4	1	1	1	28	215	3402	0
4	5	2	1	1	32	230	3538	0

Data Visualization

Before we delve into modeling, let’s explore the data a little. We will use histograms to do this, and plot them within the Notebook.

In [179]:

# show plots in the notebook
%matplotlib inline

In [180]:

# histogram of birth number
train.BIRTH.hist()
plt.title('Histogram of Low Birth Weight')
plt.xlabel('Birth Number')
plt.ylabel('Frequency')

Out[180]:

<matplotlib.text.Text at 0x244b2ef0>

In [181]:

# histogram of age of mother
train.AGE.hist()
plt.title('Histogram of Age of Mother')
plt.xlabel('Age')
plt.ylabel('Frequency')

Out[181]:

<matplotlib.text.Text at 0x244a7b38>

Let’s take a look at the distribution of smokers for those having children with low birth weights versus those who do not.

In [182]:

# Barplot of low birth weights grouped by smoker status (True or False)
pd.crosstab(train.SMOKE, train.LOW.astype(bool)).plot(kind='bar')
plt.title('Somker Distribution by Low Birth Weight')
plt.xlabel('Smoker')
plt.ylabel('Frequency')

Out[182]:

<matplotlib.text.Text at 0x26e7e588>

Configure the Data

The data from the training set has to be put into numpy arrays in order for the Random Forest algorithm to accept it. Also, the dependent variable array must be a 1d, as opposed to a column vector. train.as.matrix() will execute the array and ravel() will convert vector array into a 1d array.

In [183]:

# The data have to be in a numpy array in order for the random forest algorithm to accept it
# Also, output must be separated
cols = ['BIRTH', 'SMOKE', 'RACE', 'AGE', 'LWT', 'BWT']
colsRes = ['LOW']
trainArr = train.as_matrix(cols) #training array
trainRes = np.ravel(train.as_matrix(colsRes)) # training results

Let’s check our arrays.

In [184]:

trainArr

Out[184]:

array([[   2,    1,    1,   24,  166, 2457],
       [   1,    1,    1,   27,  124, 2932],
       [   2,    1,    1,   30,  136, 2092],
       ..., 
       [   1,    1,    1,   29,  140, 3238],
       [   2,    1,    1,   33,  161, 2966],
       [   1,    1,    1,   19,  138, 2591]], dtype=int64)

In [185]:

trainRes

Out[185]:

array([1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0], dtype=int64)

Fit the Data

Now, we fit the data using Random Forest.

In [186]:

## Training
rf = RandomForestClassifier(n_estimators=100) # initialize
rf.fit(trainArr, trainRes) # fit the data to the algorithm

Out[186]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Prepare the Testing Data

We prepare the testing data the way with did for the training data.

In [187]:

## Testing
# put the test data in the same format
testArr = test.as_matrix(cols)
results = rf.predict(testArr)

Predictions

Next, we add the predictions we obtained with the test data back to the data frame, so we can compare side-by-side

In [188]:

# Add predictions back to the data frame
test['predictions'] = results

In [189]:

test

Out[189]:

	ID	BIRTH	SMOKE	RACE	AGE	LWT	BWT	LOW	predictions
0	245	1	1	3	28	120	2865	0	0
1	246	2	1	3	33	141	2609	0	0
2	247	1	0	1	29	130	2613	0	0
3	248	2	0	1	34	151	3125	0	0
4	249	3	0	1	37	144	2481	1	1
5	250	1	1	2	31	187	1841	1	1
6	251	2	1	2	35	209	1598	1	1
7	252	3	1	2	41	217	2015	1	1
8	253	1	0	3	25	105	3489	0	0
9	254	2	0	3	30	129	3554	0	0
10	255	1	0	3	25	85	2719	0	0
11	256	2	0	3	30	106	2957	0	0
12	257	1	0	3	27	150	3226	0	0
13	258	2	0	3	33	172	3293	0	0
14	259	3	0	3	36	175	3091	0	0
15	260	1	0	3	23	97	3138	0	0
16	261	2	0	3	25	106	3247	0	0
17	262	3	0	3	31	128	3159	0	0
18	263	1	0	2	24	128	2796	0	0
19	264	2	0	2	29	152	2603	0	0
20	265	3	0	2	35	156	2884	0	0
21	266	1	0	3	24	132	3158	0	0
22	267	2	0	3	27	147	3523	0	0
23	268	1	1	1	21	165	3104	0	0
24	269	2	1	1	24	183	3012	0	0
25	270	1	1	1	29	105	3176	0	0
26	271	2	1	1	31	120	2826	0	0
27	272	3	1	1	37	130	2231	1	1
28	273	1	1	1	19	91	3335	0	0
29	274	2	1	1	24	112	3647	0	0
…	…	…	…	…	…	…	…	…	…
214	459	3	0	1	33	107	2411	1	1
215	460	1	0	1	33	202	3241	0	0
216	461	2	0	1	39	220	3666	0	0
217	462	1	0	3	28	120	3021	0	0
218	463	2	0	3	32	140	3428	0	0
219	464	3	0	3	37	140	3532	0	0
220	465	1	0	3	25	120	3134	0	0
221	466	2	0	3	27	136	3284	0	0
222	467	3	0	3	31	138	3812	0	0
223	468	4	0	3	34	129	3202	0	0
224	469	1	0	1	28	167	2172	1	1
225	470	2	0	1	32	190	2034	1	1
226	471	3	0	1	37	193	2990	0	0
227	472	1	1	1	17	122	2067	1	1
228	473	2	1	1	23	148	1702	1	1
229	474	1	0	1	29	150	2692	0	0
230	475	2	0	1	35	174	3308	0	0
231	476	1	1	2	26	168	3542	0	0
232	477	2	1	2	31	194	3386	0	0
233	478	1	0	2	17	113	2705	0	0
234	479	2	0	2	21	129	2917	0	0
235	480	3	0	2	26	132	2968	0	0
236	481	4	0	2	29	130	2878	0	0
237	482	1	0	2	17	113	3938	0	0
238	483	2	0	2	22	130	4513	0	0
239	484	1	1	1	24	90	2131	1	1
240	485	2	1	1	26	107	1452	1	1
241	486	1	1	2	32	121	2907	0	0
242	487	2	1	2	35	143	2465	1	1
243	488	1	0	1	25	155	2944	0	0

244 rows × 9 columns

Predicting Probabilities

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.

In [190]:

predicted = rf.predict(testArr)
print predicted

[0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0
 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0
 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 1 0 0 0 0 1 1
 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0
 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0]

In [191]:

# generate class probabilities
probs = rf.predict_proba(testArr)
print probs

[[ 0.92  0.08]
 [ 0.9   0.1 ]
 [ 0.96  0.04]
 [ 0.99  0.01]
 [ 0.11  0.89]
 [ 0.27  0.73]
 [ 0.27  0.73]
 [ 0.28  0.72]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.97  0.03]
 [ 0.98  0.02]
 [ 0.96  0.04]
 [ 0.95  0.05]
 [ 0.99  0.01]
 [ 0.98  0.02]
 [ 0.97  0.03]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 0.96  0.04]
 [ 0.97  0.03]
 [ 0.98  0.02]
 [ 0.99  0.01]
 [ 1.    0.  ]
 [ 0.95  0.05]
 [ 0.01  0.99]
 [ 0.99  0.01]
 [ 0.96  0.04]
 [ 0.93  0.07]
 [ 0.02  0.98]
 [ 0.06  0.94]
 [ 0.96  0.04]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.97  0.03]
 [ 0.85  0.15]
 [ 0.14  0.86]
 [ 0.97  0.03]
 [ 0.06  0.94]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 0.06  0.94]
 [ 0.07  0.93]
 [ 0.06  0.94]
 [ 0.99  0.01]
 [ 0.98  0.02]
 [ 0.96  0.04]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.04  0.96]
 [ 0.15  0.85]
 [ 0.99  0.01]
 [ 1.    0.  ]
 [ 0.94  0.06]
 [ 0.1   0.9 ]
 [ 0.02  0.98]
 [ 0.    1.  ]
 [ 0.97  0.03]
 [ 0.01  0.99]
 [ 0.03  0.97]
 [ 0.97  0.03]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.16  0.84]
 [ 0.98  0.02]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 1.    0.  ]
 [ 0.9   0.1 ]
 [ 0.09  0.91]
 [ 0.03  0.97]
 [ 0.06  0.94]
 [ 0.03  0.97]
 [ 0.98  0.02]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 0.98  0.02]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.95  0.05]
 [ 0.94  0.06]
 [ 0.97  0.03]
 [ 0.98  0.02]
 [ 0.97  0.03]
 [ 0.89  0.11]
 [ 0.05  0.95]
 [ 0.03  0.97]
 [ 0.03  0.97]
 [ 0.98  0.02]
 [ 0.95  0.05]
 [ 0.26  0.74]
 [ 0.98  0.02]
 [ 0.01  0.99]
 [ 0.02  0.98]
 [ 0.01  0.99]
 [ 0.23  0.77]
 [ 0.99  0.01]
 [ 0.27  0.73]
 [ 0.3   0.7 ]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.96  0.04]
 [ 0.96  0.04]
 [ 0.12  0.88]
 [ 0.15  0.85]
 [ 0.18  0.82]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 0.99  0.01]
 [ 1.    0.  ]
 [ 0.96  0.04]
 [ 0.99  0.01]
 [ 0.06  0.94]
 [ 0.01  0.99]
 [ 0.04  0.96]
 [ 0.09  0.91]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.01  0.99]
 [ 1.    0.  ]
 [ 0.99  0.01]
 [ 0.03  0.97]
 [ 0.01  0.99]
 [ 0.01  0.99]
 [ 0.04  0.96]
 [ 0.11  0.89]
 [ 0.11  0.89]
 [ 1.    0.  ]
 [ 0.94  0.06]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 0.95  0.05]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 0.99  0.01]
 [ 0.96  0.04]
 [ 0.92  0.08]
 [ 0.94  0.06]
 [ 0.02  0.98]
 [ 0.01  0.99]
 [ 0.02  0.98]
 [ 0.96  0.04]
 [ 0.98  0.02]
 [ 0.96  0.04]
 [ 0.91  0.09]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 0.99  0.01]
 [ 0.98  0.02]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.89  0.11]
 [ 0.    1.  ]
 [ 0.01  0.99]
 [ 0.02  0.98]
 [ 0.06  0.94]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.97  0.03]
 [ 0.93  0.07]
 [ 0.98  0.02]
 [ 0.98  0.02]
 [ 0.9   0.1 ]
 [ 0.95  0.05]
 [ 0.99  0.01]
 [ 0.92  0.08]
 [ 0.96  0.04]
 [ 0.93  0.07]
 [ 0.98  0.02]
 [ 0.96  0.04]
 [ 0.96  0.04]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 0.92  0.08]
 [ 0.98  0.02]
 [ 0.02  0.98]
 [ 0.92  0.08]
 [ 0.99  0.01]
 [ 0.98  0.02]
 [ 0.04  0.96]
 [ 0.98  0.02]
 [ 0.97  0.03]
 [ 0.96  0.04]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.94  0.06]
 [ 0.03  0.97]
 [ 0.06  0.94]
 [ 0.99  0.01]
 [ 0.98  0.02]
 [ 0.94  0.06]
 [ 0.89  0.11]
 [ 0.09  0.91]
 [ 0.11  0.89]
 [ 0.13  0.87]
 [ 0.97  0.03]
 [ 0.98  0.02]
 [ 0.02  0.98]
 [ 1.    0.  ]
 [ 1.    0.  ]
 [ 0.97  0.03]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 1.    0.  ]
 [ 0.99  0.01]
 [ 0.04  0.96]
 [ 0.22  0.78]
 [ 1.    0.  ]
 [ 0.05  0.95]
 [ 0.03  0.97]
 [ 0.95  0.05]
 [ 1.    0.  ]
 [ 0.96  0.04]
 [ 0.99  0.01]
 [ 0.96  0.04]
 [ 0.99  0.01]
 [ 0.97  0.03]
 [ 0.97  0.03]
 [ 0.98  0.02]
 [ 1.    0.  ]
 [ 0.02  0.98]
 [ 0.01  0.99]
 [ 0.94  0.06]
 [ 0.2   0.8 ]
 [ 1.    0.  ]]

Predicting the Probability of Low Birth Weight Child¶¶

Just for fun, let’s predict the probability of a low birth weight child for a random woman not present in the dataset. She’s a 35-year-old Other race, has had 2 births,(has 2 children), is a smoker, and her weight is 132. [BIRTH SMOKE RACE AGE LWT BWT LOW ]

In [192]:

rf.predict_proba(np.array([0, 1, 1, 35, 192, 1]))

Out[192]:

array([[ 0.22,  0.78]])

Accuracy Check

Finally, we check the accuracy on the test set and generate evaluation metrics.

In [193]:

testRes = test.as_matrix(colsRes) # training results
# check the accuracy on the training set
rf.score(testArr,testRes)

Out[193]:

1.0

In [194]:

# generate evaluation metrics
print metrics.accuracy_score(testRes, predicted)
print metrics.roc_auc_score(testRes, probs[:, 1])

1.0
1.0

Though this will not always happen, our predictions appear to be perfect.

Conclusion

The Random Forest algorithm predicted class perfectly with this dataset. That is unlikely to happen with larger datasets, e.g., more records and more variables.

Sometimes in machine learning, models will be overfitted. That is, we may build our models too specific to the training data, and the model takes on the random gradations of the training data. This can cause problems when we try to generalize the model. As good practice, if your initial dataset is a large enough, we split the data into training and test data.

Authored by:
Jeffrey Strickland, Ph.D.

Jeffrey Strickland, Ph.D., is the Author of Predictive Analytics Using R and a Senior Analytics Scientist with Clarity Solution Group. He has performed predictive modeling, simulation and analysis for the Department of Defense, NASA, the Missile Defense Agency, and the Financial and Insurance Industries for over 20 years. Jeff is a Certified Modeling and Simulation professional (CMSP) and an Associate Systems Engineering Professional (ASEP). He has published nearly 200 blogs on LinkedIn, is also a frequently invited guest speaker and the author of 20 books including:

Operations Research using Open-Source Tools
Discrete Event simulation using ExtendSim
Crime Analysis and Mapping
Missile Flight Simulation
Mathematical Modeling of Warfare and Combat Phenomenon
Predictive Modeling and Analytics
Using Math to Defeat the Enemy
Verification and Validation for Modeling and Simulation
Simulation Conceptual Modeling
System Engineering Process and Practices

Connect with Jeffrey Strickland
Contact Jeffrey Strickland

Random Forest using iPython

Introduction

The Dataset

Problem Statement

Import Modules

Import Datasets

Data Visualization

Configure the Data

Fit the Data

Prepare the Testing Data

Predictions

Predicting Probabilities

Predicting the Probability of Low Birth Weight Child¶¶

Accuracy Check

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112