Data Science tutorial by Seth Gregory
Heart disease is one of the most deadly types of disease in the world. While it encompasses a wide variety of illnesses, it is the leading cause of death in both men, women, and most demographics in the United States. Roughly 1 in 4 American deaths every year is from cardiovascular disease, and one person dies from it about every 36 seconds. So why try and predict heart disease? The more able we are to quickly and reliably catch the heart disease, the sooner it can be treated and monitored to prevent the risk of further injury or death. While heart disease hits seemingly indiscrimately, there are often common factors in patients of heart disease that may be used to predict its onset. Many of these are lifestyle choices, but in this project we will consider primarily numeric medical measurements and how they might indicate the presence of heart disease. Source
Luckily, there are many open-access online sources for data on patients of heart disease at our disposal. Since I would like to focus our analysis on non-directly-lifestyle factors, I chose to use the UCI Machine Learning Institute, which obtained the dataset from four hospitals in Cleveland, Hungary, Switzerland, and the Long Beach, VA.
The dataset was downloaded from Kaggle and the following file was obtained:
Pandas was used to place the data into a dataframe, to be used for our analysis.
# Necessary imports for data analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import sklearn
from sklearn.linear_model import LinearRegression
# Set style of matplotlib
plt.style.use("seaborn-darkgrid")
# Input the dataset
data = pd.read_csv('heart.csv')
data.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
The dataset contains the following attributes, description provided from source:
We'll use a dictionary to keep track of the associated variable names:
variable_names = {
'age': "Age (years)",
'sex': "Sex (0=female, 1=male)",
'cp': "Chest Pain Type",
'trestbps': "Resting blood pressure (mm Hg)",
'chol': "Serum cholestoral (mg/dl)",
'fbs': "Fasting blood sugar > 120 mg/dl",
'restecg': "Resting ECG results",
'thalach': "Maximum heart rate",
'exang': "Exercise induced angina",
'oldpeak': "ST depression induced by exercise related to rest",
'slope': "Peak ST segment slope",
'ca': 'Number of major vessels colored by flourosopy',
'thal': 'Thalassemia',
'target': "Has heart disease",
}
We would like to create some sort of model to predict the presence of heart disease in a patient, based on some of the other values in our table. In the context of this data, this means we would like to train some subset of the data to accurately categorize any hypothetical new data point as having 'target' value of either '1' or '0' based on the values of its other attributes.
First, however, we'd like to get a better handle of the data we have, and see which attributes tend to correlate with one another. Ideally, we want to find attributes that correlate strongly with 'target' (i.e., that will be good indicators of the presence of heart disease). For our analysis, we have:
Independent Variables:
Dependent Variables:
However, this classification may prove a little misleading. Rather, it does not capture the full relationship between the variables. More accurately, there may be relationships between the 'independent' variables that would suggest that they are not entirely independent from one another. In fact, we may even suspect this to be the case intuitively: for example, the 'slope' and 'oldpeak' both relate to the ST segment of the ECG, maybe we would expect older patients to experience a higher heartrate, etc. We will explore these possibilities below:
We'll begin our exploratory analysis by performing some basic histograms to get a basic idea of how patients with and without heart-disease are distributed across each of the variables. It is important for us to distinguish between 'categorical' (discrete) and 'continuous' variables for the purpose of the analysis, which we'll touch on again further in the project.
# Split the data into 'categorical' and 'continous'
# Categorical variables
categ_data = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'target']
# Continuous variables
contin_data = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
# Make the histograms for categorical data
plt.figure(figsize=(15,15))
for i, column in enumerate(categ_data, 1):
plt.subplot(3, 3, i)
data[data["target"] == 0][column].hist(bins=5, color='blue', label='Does not have Heart Disease', alpha=0.6)
data[data["target"] == 1][column].hist(bins=5, color='red', label='Has Heart Disease', alpha=0.6)
plt.legend()
plt.xlabel(variable_names[column])
# Make the continuous historgrams, with larger bin size
plt.figure(figsize=(15, 15))
for i, column in enumerate(contin_data, 1):
plt.subplot(3, 2, i)
data[data["target"] == 0][column].hist(bins=50, color='blue', label='Does not have Heart Disease', alpha=0.6)
data[data["target"] == 1][column].hist(bins=50, color='red', label='Has Heart Disease', alpha=0.6)
plt.legend()
plt.xlabel(variable_names[column])
Looking at the histograms, we are looking for meaningful 'separations' in each graph between patients with and without heart disease that would suggest that the variable in question has some correlation with the presence of heart disease. There are several observations:
First, we should note the demographic distribution of our data. We have roughly an equal number of subjects with and without heart disease, so our data is even in that regard. However, our data skews toward male patients over female patients, with a substantially greater percentage of male patients having heart disease than female patients. This would seem to suggest that male patients have a significantly greater rate of heart disease than female patients, but this could be merely a feature of our dataset.
As for patterns we notice: a significant majority of patients with heart disease seem to have the following attributes that distinguish them from patients without heart disease:
This would suggest that, based on our sample, the variables ('exerg', 'oldpeak', 'thalach', 'thal', 'ca', 'slope') might be good indicators of heart disease in a patient. Before moving forward with our analysis, let's check and see if any of these variables correlate with one another by using a heatmap.
corr = data.corr()
plt.subplots(figsize=(13,10))
hm = sn.heatmap(corr, annot=True)
hm.set_title("Correlation Matrix between Columns")
hm.set_xticklabels(hm.get_xticklabels(), rotation=45, horizontalalignment='right')
hm.set_yticklabels(hm.get_yticklabels(), rotation=45, horizontalalignment='right')
plt.show()
The correlation matrix corroborates some observations we already made: the variables that most correlate with 'target' are 'cp', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', and 'thal'. The correlation values aren't particularly large, but they are comparatively large compared to the rest of the dataset. So, we would like to examine the correlation between these variables and 'target' further with R2 tests.
First, however, we must take note of the correlation between the 'independent' variables: 'cp' and 'thal' do not correlate particularly strongly with variables other than 'target', but 'ca', 'thalach', 'exang', and 'slope' all have relatively high correlation with one another, equal to or even greater than their correlation with 'target', so we should consider examining these together with 'target'.
For further analysis, we would like to examine the strength of the variation of each independent variable and the dependent variable. Using a significance level of 0.05, based on each test we will determine if we can claim there is a statistically significant relationship between each set of variables.
First, we will examine the relationship between each pair of ('cp', 'thalach', 'exang', 'slope') to verify that we should consider these together in our test with 'target'.
import statsmodels.api as sm
# Independent
X = data['cp']
# Dependent
y = data['exang']
model = sm.OLS(y,X).fit()
model.summary()
Dep. Variable: | exang | R-squared (uncentered): | 0.024 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.021 |
Method: | Least Squares | F-statistic: | 7.461 |
Date: | Mon, 17 May 2021 | Prob (F-statistic): | 0.00668 |
Time: | 20:18:32 | Log-Likelihood: | -256.77 |
No. Observations: | 303 | AIC: | 515.5 |
Df Residuals: | 302 | BIC: | 519.3 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
cp | 0.0628 | 0.023 | 2.731 | 0.007 | 0.018 | 0.108 |
Omnibus: | 4406.815 | Durbin-Watson: | 1.311 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 51.572 |
Skew: | 0.716 | Prob(JB): | 6.33e-12 |
Kurtosis: | 1.573 | Cond. No. | 1.00 |
import statsmodels.api as sm
# Independent
X = data['cp']
# Dependent
y = data['oldpeak']
model = sm.OLS(y,X).fit()
model.summary()
Dep. Variable: | oldpeak | R-squared (uncentered): | 0.141 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.138 |
Method: | Least Squares | F-statistic: | 49.70 |
Date: | Mon, 17 May 2021 | Prob (F-statistic): | 1.22e-11 |
Time: | 20:18:32 | Log-Likelihood: | -541.03 |
No. Observations: | 303 | AIC: | 1084. |
Df Residuals: | 302 | BIC: | 1088. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
cp | 0.4142 | 0.059 | 7.050 | 0.000 | 0.299 | 0.530 |
Omnibus: | 55.440 | Durbin-Watson: | 1.189 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 83.289 |
Skew: | 1.123 | Prob(JB): | 8.21e-19 |
Kurtosis: | 4.247 | Cond. No. | 1.00 |
import statsmodels.api as sm
# Independent
X = data['exang']
# Dependent
y = data['oldpeak']
model = sm.OLS(y,X).fit()
model.summary()
Dep. Variable: | oldpeak | R-squared (uncentered): | 0.311 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.309 |
Method: | Least Squares | F-statistic: | 136.3 |
Date: | Mon, 17 May 2021 | Prob (F-statistic): | 3.02e-26 |
Time: | 20:18:32 | Log-Likelihood: | -507.66 |
No. Observations: | 303 | AIC: | 1017. |
Df Residuals: | 302 | BIC: | 1021. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
exang | 1.5192 | 0.130 | 11.676 | 0.000 | 1.263 | 1.775 |
Omnibus: | 50.583 | Durbin-Watson: | 1.561 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 95.334 |
Skew: | 0.897 | Prob(JB): | 1.99e-21 |
Kurtosis: | 5.081 | Cond. No. | 1.00 |
import statsmodels.api as sm
# Independent
X = data['oldpeak']
# Dependent
y = data['thalach']
model = sm.OLS(y,X).fit()
model.summary()
Dep. Variable: | thalach | R-squared (uncentered): | 0.386 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.384 |
Method: | Least Squares | F-statistic: | 189.9 |
Date: | Mon, 17 May 2021 | Prob (F-statistic): | 7.54e-34 |
Time: | 20:18:32 | Log-Likelihood: | -1877.0 |
No. Observations: | 303 | AIC: | 3756. |
Df Residuals: | 302 | BIC: | 3760. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
oldpeak | 60.4062 | 4.384 | 13.779 | 0.000 | 51.780 | 69.033 |
Omnibus: | 39.033 | Durbin-Watson: | 0.722 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 50.086 |
Skew: | -0.951 | Prob(JB): | 1.33e-11 |
Kurtosis: | 3.589 | Cond. No. | 1.00 |
import statsmodels.api as sm
# Independent
X = data['thalach']
# Dependent
y = data['cp']
model = sm.OLS(y,X).fit()
model.summary()
Dep. Variable: | cp | R-squared (uncentered): | 0.503 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.501 |
Method: | Least Squares | F-statistic: | 305.4 |
Date: | Mon, 17 May 2021 | Prob (F-statistic): | 9.78e-48 |
Time: | 20:18:32 | Log-Likelihood: | -428.85 |
No. Observations: | 303 | AIC: | 859.7 |
Df Residuals: | 302 | BIC: | 863.4 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
thalach | 0.0066 | 0.000 | 17.474 | 0.000 | 0.006 | 0.007 |
Omnibus: | 89.179 | Durbin-Watson: | 1.786 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 28.345 |
Skew: | 0.529 | Prob(JB): | 7.00e-07 |
Kurtosis: | 1.938 | Cond. No. | 1.00 |
import statsmodels.api as sm
# Independent
X = data['thalach']
# Dependent
y = data['exang']
model = sm.OLS(y,X).fit()
model.summary()
Dep. Variable: | exang | R-squared (uncentered): | 0.268 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.266 |
Method: | Least Squares | F-statistic: | 110.8 |
Date: | Mon, 17 May 2021 | Prob (F-statistic): | 2.80e-22 |
Time: | 20:18:32 | Log-Likelihood: | -213.12 |
No. Observations: | 303 | AIC: | 428.2 |
Df Residuals: | 302 | BIC: | 431.9 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
thalach | 0.0020 | 0.000 | 10.526 | 0.000 | 0.002 | 0.002 |
Omnibus: | 15133.636 | Durbin-Watson: | 1.691 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 52.972 |
Skew: | 0.731 | Prob(JB): | 3.14e-12 |
Kurtosis: | 1.565 | Cond. No. | 1.00 |
From our R2 analysis of the relationships between cp, exang, oldpeak, and thalach, all but two of these had R2 values of over 25%, indicating that these variables have statistically significant correlations between them that we should consider together when calculating their correlation with the presence of heart disease. The reason we do this is so that we can more confidently say how much variation in heart disease is due to variation of our input variables. If we did not account for these interactions between the 'independent' variables, then this becomes more uncertain. For example, if we only considered the correlation between chest pain and heart disease, this does not account for how much chest pain may in turn be affected by maximum heart rate, which our tests concluded may be as high as 50%. This hurts our ability to truly understand the relationship between any one of these independent variables and the presence of heart disease. Thus, in considering all of these variables together with heart disease, we can more generally analyze how much variation in the presence of heart disease is due to all of these factors together, by considering the interaction terms between the input variables as well.
import statsmodels.api as sm
# Independent
X = data[['cp', 'exang', 'oldpeak', 'thalach']]
# Dependent
y = data['target']
model = sm.OLS(y,X).fit()
model.summary()
Dep. Variable: | target | R-squared (uncentered): | 0.722 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.719 |
Method: | Least Squares | F-statistic: | 194.5 |
Date: | Mon, 17 May 2021 | Prob (F-statistic): | 6.80e-82 |
Time: | 20:18:32 | Log-Likelihood: | -143.71 |
No. Observations: | 303 | AIC: | 295.4 |
Df Residuals: | 299 | BIC: | 310.3 |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
cp | 0.1292 | 0.024 | 5.360 | 0.000 | 0.082 | 0.177 |
exang | -0.1928 | 0.052 | -3.698 | 0.000 | -0.295 | -0.090 |
oldpeak | -0.1177 | 0.020 | -5.975 | 0.000 | -0.156 | -0.079 |
thalach | 0.0040 | 0.000 | 14.678 | 0.000 | 0.003 | 0.005 |
Omnibus: | 12.000 | Durbin-Watson: | 0.831 |
---|---|---|---|
Prob(Omnibus): | 0.002 | Jarque-Bera (JB): | 12.388 |
Skew: | -0.469 | Prob(JB): | 0.00204 |
Kurtosis: | 2.679 | Cond. No. | 359. |
Finally, we would also like to consider the correlations between ca and target, and thal and target:
import statsmodels.api as sm
# Independent
X = data['ca']
# Dependent
y = data['target']
model = sm.OLS(y,X).fit()
model.summary()
Dep. Variable: | target | R-squared (uncentered): | 0.046 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.043 |
Method: | Least Squares | F-statistic: | 14.48 |
Date: | Mon, 17 May 2021 | Prob (F-statistic): | 0.000172 |
Time: | 20:18:32 | Log-Likelihood: | -330.77 |
No. Observations: | 303 | AIC: | 663.5 |
Df Residuals: | 302 | BIC: | 667.2 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ca | 0.1258 | 0.033 | 3.805 | 0.000 | 0.061 | 0.191 |
Omnibus: | 1664.957 | Durbin-Watson: | 0.051 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 43.452 |
Skew: | -0.207 | Prob(JB): | 3.67e-10 |
Kurtosis: | 1.192 | Cond. No. | 1.00 |
import statsmodels.api as sm
# Independent
X = data['thal']
# Dependent
y = data['target']
model = sm.OLS(y,X).fit()
model.summary()
Dep. Variable: | target | R-squared (uncentered): | 0.428 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.426 |
Method: | Least Squares | F-statistic: | 225.9 |
Date: | Mon, 17 May 2021 | Prob (F-statistic): | 1.66e-38 |
Time: | 20:18:32 | Log-Likelihood: | -253.25 |
No. Observations: | 303 | AIC: | 508.5 |
Df Residuals: | 302 | BIC: | 512.2 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
thal | 0.2017 | 0.013 | 15.030 | 0.000 | 0.175 | 0.228 |
Omnibus: | 1647.199 | Durbin-Watson: | 0.087 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 43.904 |
Skew: | -0.207 | Prob(JB): | 2.93e-10 |
Kurtosis: | 1.182 | Cond. No. | 1.00 |
Now that we have done a bit of analysis on factors that might help us predict the presence of heart disease, let's turn to using several machine learning models to implement this prediction. Machine learning is, generally speaking, a term used to describe a model that self-learns from data and make decisions based on minimal human-interaction. For the purposes of 'teaching' the models, we provide a randomly selected subset of our data as "training data," and use the unselected data as "test data." This is to ensure that the model isn't "overfitted" to our specific training data, i.e. essentially creating a model that works great for our sample, but doesn't apply itself very well to other data sets. More info on overfitting here For the purposes of our models, we'll use 80% of the sample as a training set, and 20% as a test set.
One thing we must first consider in building our model is handling our categorical data. For the purposes of a regression model, we need to change each of our categorical variables to "dummy" variables. Inherently, though we may represent something like chest pain as a number between 1 and 3, there is no mathematical reason why, for example, 2 * chest_pain_1 should equal chest_pain_2, but our regression model will infer this sort of mathematical relationship between our variables unless we make some changes in the way our data is stored. We do this by adding "dummy" columns for each possible value of each categorical variable, simply indicating a '1' or '0' corresponding to whether that categorical variable is present. This is displayed below.
Furthermore, since each of our variables has different scales, and we primarily want to compare the variations between them, we should also normalize our data before training our models on it.
from sklearn.model_selection import train_test_split
data2 = data.copy()
# Make dummy variable columns
cp_dummies = pd.get_dummies(data2['cp'], prefix='cp')
thal_dummies = pd.get_dummies(data2['thal'], prefix='thal')
slope_dummies = pd.get_dummies(data2['slope'], prefix='slope')
dummy_columns = [data2, cp_dummies, thal_dummies, slope_dummies]
data_cp = pd.concat(dummy_columns, axis=1)
data2 = data_cp.drop(columns=['cp','thal','slope'])
# Determine input and output variables
y = data_cp
x = data2.copy().drop(['target'], axis=1)
# Normalize the data
x = (x - np.min(x)) / (np.max(x) - np.min(x)).values
# Split data into test set and train set
x_train, x_test, y_train_t, y_test_t = train_test_split(x, y, test_size = 0.2, random_state=0)
y_train = y_train_t['target'].values
y_test = y_test_t['target'].values
# Show new dummy-ized dataset
data2.head()
age | sex | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | ca | ... | cp_1 | cp_2 | cp_3 | thal_0 | thal_1 | thal_2 | thal_3 | slope_0 | slope_1 | slope_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
1 | 37 | 1 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
2 | 41 | 0 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
3 | 56 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
4 | 57 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
5 rows × 22 columns
First, we'll use a simple model called Logistic Regression. This model focuses on optimizing an equation taking in a set of input variables and using them to predict our output variable, attempting to minimize 'loss.' Loss refers to the difference between what our model predicts our test values to be and what our test values actually are. In this context, it will be a measure of how often we correctly identify a patient as having heart disease or not.
from sklearn.linear_model import LogisticRegression
import seaborn as sns
model = LogisticRegression().fit(x_train, y_train)
y_predict = model.predict(x_test)
residuals = y_predict - y_test
plt.figure(figsize=(13,7))
reg_violin_plot = sns.violinplot(x=y_test_t['cp'], y=residuals, palette='rocket')
reg_violin_plot.set(xlabel="Chest pain type",
ylabel="Heart Disease logistic regression model residual",
title='Heart Disease Logistic Regression Model Residuals by Chest Pain Type')
plt.show()
Above is a violin plot demonstrating the range of our residuals (loss) between our actual and predicted data using the regression model, separated by the type of chest pain. The average value of our residuals tends to be about zero, but notice a few interesting observations: firstly, this model has a number of outliers that suggest that our logistic regression prediction could be better. Secondly, for patients who displayed chest pain of type 3 (asymptomatic chest pain unrelated to disease, our model was much more inaccurate at correctly predicting heart disease. Intuitively this makes some degree of sense, we would expect patients who already demonstrate chest pain indicative of some sort of angina to be more at risk of heart disease.
# Imports
from sklearn.metrics import accuracy_score
test_score = accuracy_score(y_test, y_predict) * 100
train_score = accuracy_score(y_train, model.predict(x_train)) * 100
# Analyze the accuracy of our model
results = pd.DataFrame(data=[["Logistic Regression", train_score, test_score]], columns=['Model', 'Train Accuracy %', 'Test Accuracy %'])
results
Model | Train Accuracy % | Test Accuracy % | |
---|---|---|---|
0 | Logistic Regression | 85.950413 | 86.885246 |
According to our accuracy score, our model had an 85.95% accuracy rate on the training set, and an 86.89% accuracy rate on the test set. This suggests that our model was fairly effective at predicting the presence of heart disease in patients. However, let's see if we can come up with an even better model, using some other machine learning models:
K-Nearest Neighbors is a classification model that tries to predict the value of an output based on similar "points" in the dataset "close-by" to it. While it's a bit difficult to visualize what this might look like in higher dimensions, in theory a multi-dimensional scatter plot of our input data will demonstrate some type of grouping that separates patients with heart disease from those without, and the k-nearest neighbors model will use a measure of "distance" to determine the class of each test input.
from sklearn.neighbors import KNeighborsClassifier
# Initialize and train the model
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
# Calculate residuals
y_predict = knn.predict(x_test)
residuals = y_predict - y_test
# Make the violin plot
plt.figure(figsize=(13,7))
reg_violin_plot = sns.violinplot(x=y_test_t['cp'], y=residuals, palette='rocket')
reg_violin_plot.set(xlabel="Chest pain type",
ylabel="Heart Disease k-nearest neighbors model residual",
title='Heart Disease K-Nearest Neighbors Model Residuals by Chest Pain Type')
plt.show()
# Calculate accuracy score
test_score = accuracy_score(y_test, y_predict) * 100
train_score = accuracy_score(y_train, knn.predict(x_train)) * 100
results = results.append(pd.DataFrame(data=[["K-nearest neighbors", train_score, test_score]], columns=['Model', 'Train Accuracy %', 'Test Accuracy %']), ignore_index=True)
results
Model | Train Accuracy % | Test Accuracy % | |
---|---|---|---|
0 | Logistic Regression | 85.950413 | 86.885246 |
1 | K-nearest neighbors | 83.057851 | 85.245902 |
According to our accuracy score, the k-nearest neighbors model performed slightly worse than the logistic regression model, with an accuracy of 85.25% on the test set. One reason for this discrepancy might be the large number of input variables in our dataset; with higher numbers of input variables, it becomes more difficult to calculate "distance"
Next we will consider a decision tree model. A decision tree is made up of several nodes originating from one and then deciding which way to move down the tree based on the value of the input variables. In this way, the decision tree "splits" the data continuously by "asking questions" until it classifies the input as either having heart disease, or not. For this decision tree, we choose a max_depth of 10, to indicate that the tree will build at most 10 levels down into the tree to decide on classification.
# Imports
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Build the decision tree model and train it
dtree = DecisionTreeClassifier(random_state=0,max_depth=10)
dtree.fit(x_train, y_train)
# Display the tree
fig, ax = plt.subplots(figsize=(30,12))
tree.plot_tree(dtree, fontsize=12, feature_names=list(x_train.columns), max_depth=3)
plt.show()
# Calculate the residuals
y_predict = dtree.predict(x_test)
residuals = y_predict - y_test
# Make a violin plot of the residuals
plt.figure(figsize=(13,7))
reg_violin_plot = sns.violinplot(x=y_test_t['cp'], y=residuals, palette='rocket')
reg_violin_plot.set(xlabel="Chest pain type",
ylabel="Heart Disease decision tree model residual",
title='Heart Disease Decision Tree Model Residuals by Chest Pain Type')
plt.show()
# Calculate the accuracy score
test_score = accuracy_score(y_test, y_predict) * 100
train_score = accuracy_score(y_train, dtree.predict(x_train)) * 100
results = results.append(pd.DataFrame(data=[["Decision Tree Classifier", train_score, test_score]], columns=['Model', 'Train Accuracy %', 'Test Accuracy %']), ignore_index=True)
results
Model | Train Accuracy % | Test Accuracy % | |
---|---|---|---|
0 | Logistic Regression | 85.950413 | 86.885246 |
1 | K-nearest neighbors | 83.057851 | 85.245902 |
2 | Decision Tree Classifier | 100.000000 | 78.688525 |
In this case, our model on the training data performed with 100% accuracy, while it performed on the test data with 78% accuracy. The residuals also seem to include more outliers. This may suggest that our data is over-fitted to the training data, though is may also suggest that a decision tree is simply not the best choice for a model in this case, since lowering the number of levels did not improve the accuracy %.
Next we will try another model, the Support Vector Machine (SVM) classification. Another classification method, this one focuses on creating a 'dividing' line for classifying our data. In particular, this model is effective at dealing with high-dimensional data, so we might expect it to perform better.
from sklearn.svm import SVC
# Build and train the SVM model
svm = SVC(kernel='rbf', gamma=0.1, C=1.0)
svm.fit(x_train, y_train)
# Calculate the residuals
y_predict = svm.predict(x_test)
residuals = y_predict - y_test
# Make the violin plot of residuals
plt.figure(figsize=(13,7))
reg_violin_plot = sns.violinplot(x=y_test_t['cp'], y=residuals, palette='rocket')
reg_violin_plot.set(xlabel="Chest pain type",
ylabel="Heart Disease support vector machine model residual",
title='Heart Disease Support Vector Machine Model Residuals by Chest Pain Type')
plt.show()
# Calculate the accuracy score
test_score = accuracy_score(y_test, y_predict) * 100
train_score = accuracy_score(y_train, svm.predict(x_train)) * 100
results = results.append(pd.DataFrame(data=[["Support Vector Classifier", train_score, test_score]], columns=['Model', 'Train Accuracy %', 'Test Accuracy %']), ignore_index=True)
results
Model | Train Accuracy % | Test Accuracy % | |
---|---|---|---|
0 | Logistic Regression | 85.950413 | 86.885246 |
1 | K-nearest neighbors | 83.057851 | 85.245902 |
2 | Decision Tree Classifier | 100.000000 | 78.688525 |
3 | Support Vector Classifier | 85.537190 | 86.885246 |
According to our accuracy score, the SVM model performed about as well as the logistic regression model, with a test accuracy percentage of 86.89%. The residuals also appear relatively the same as in the regression model. The reason it looks so similar to the logistic regression model results may be due to the relatively small size of the dataset (<500)
Lastly, we'll try the Random Forest Classifier model. Similar to decision trees, a random forest classifier consists of a "forest" of decision trees, using randomly selected observations and features to classify data, rather than a strict input, averaging the results of all of the decision trees at the end. In this way, it lessens the risk of overfitting that decision trees might bring. For our decision tree, we'll choose an n_estimators value of 1000, which means we'll use 1000 decision trees and average their result. Generally speaking, increasing this parameter will improve the result, but increase the computation time.
from sklearn.ensemble import RandomForestClassifier
# Build the random forest model and train
rf = RandomForestClassifier(n_estimators=1000, random_state=18)
rf.fit(x_train, y_train)
# Calculate the residuals
y_predict = rf.predict(x_test)
residuals = y_predict - y_test
# Make violin plot of the residuals
plt.figure(figsize=(13,7))
reg_violin_plot = sns.violinplot(x=y_test_t['cp'], y=residuals, palette='rocket')
reg_violin_plot.set(xlabel="Chest pain type",
ylabel="Heart Disease random forest model residual",
title='Heart Disease Random Forest Model Residuals by Chest Pain Type')
plt.show()
# Calculate the accuracy score
test_score = accuracy_score(y_test, y_predict) * 100
train_score = accuracy_score(y_train, rf.predict(x_train)) * 100
results = results.append(pd.DataFrame(data=[["Random Forest Classifier", train_score, test_score]], columns=['Model', 'Train Accuracy %', 'Test Accuracy %']), ignore_index=True)
results
Model | Train Accuracy % | Test Accuracy % | |
---|---|---|---|
0 | Logistic Regression | 85.950413 | 86.885246 |
1 | K-nearest neighbors | 83.057851 | 85.245902 |
2 | Decision Tree Classifier | 100.000000 | 78.688525 |
3 | Support Vector Classifier | 85.537190 | 86.885246 |
4 | Random Forest Classifier | 100.000000 | 90.163934 |
The random forest classifier has our best accuracy percentage, of 90.16% on the test data. This can likely be attributed to the randomized nature of the tree, and the large number of estimators used in building the model.
Predicting the presence of heart disease is a difficult task. It is a difficult process reliant on a ton of different factors that our analysis was not able to consider. For one, heart disease refers to a wide variety of diseases. It is likely that many of the variables we considered in this project affect different sets of these diseases in different ways. There are also many commonly considered factors that this project did not take into account. Some of the best known indicators are smoking, diabetes, weight, diet, physical activity, and alcohol use, none of which we touched on in this project. Furthermore, our dataset considered only patients from 4 particular hospitals, and a small set of patients at that. We do not know the demographics of these individuals, so we cannot know for sure how skewed the data might be in that regard. To know more, we would have to find another dataset, make sure the data is comparable, and run our models on it.
However, we did find some interesting insights from out data: namely, at least for our dataset, the random forest classifier did a pretty accurate job of predicting heart disease from purely the factors we considered. We established a significant correlation between several of our variables and the presence of heart disease, and we noticed that it was easier to predict the presence of heart disease in patients whose chest pain was not asymptomatic.