Predicting Heart Disease

Data Science tutorial by Seth Gregory

Introduction

Heart disease is one of the most deadly types of disease in the world. While it encompasses a wide variety of illnesses, it is the leading cause of death in both men, women, and most demographics in the United States. Roughly 1 in 4 American deaths every year is from cardiovascular disease, and one person dies from it about every 36 seconds. So why try and predict heart disease? The more able we are to quickly and reliably catch the heart disease, the sooner it can be treated and monitored to prevent the risk of further injury or death. While heart disease hits seemingly indiscrimately, there are often common factors in patients of heart disease that may be used to predict its onset. Many of these are lifestyle choices, but in this project we will consider primarily numeric medical measurements and how they might indicate the presence of heart disease. Source

Gathering the Data

Luckily, there are many open-access online sources for data on patients of heart disease at our disposal. Since I would like to focus our analysis on non-directly-lifestyle factors, I chose to use the UCI Machine Learning Institute, which obtained the dataset from four hospitals in Cleveland, Hungary, Switzerland, and the Long Beach, VA.

The dataset was downloaded from Kaggle and the following file was obtained:

Pandas was used to place the data into a dataframe, to be used for our analysis.

Describing the dataset

The dataset contains the following attributes, description provided from source:

We'll use a dictionary to keep track of the associated variable names:

Exploratory Analysis

We would like to create some sort of model to predict the presence of heart disease in a patient, based on some of the other values in our table. In the context of this data, this means we would like to train some subset of the data to accurately categorize any hypothetical new data point as having 'target' value of either '1' or '0' based on the values of its other attributes.

First, however, we'd like to get a better handle of the data we have, and see which attributes tend to correlate with one another. Ideally, we want to find attributes that correlate strongly with 'target' (i.e., that will be good indicators of the presence of heart disease). For our analysis, we have:

Independent Variables:

Dependent Variables:

However, this classification may prove a little misleading. Rather, it does not capture the full relationship between the variables. More accurately, there may be relationships between the 'independent' variables that would suggest that they are not entirely independent from one another. In fact, we may even suspect this to be the case intuitively: for example, the 'slope' and 'oldpeak' both relate to the ST segment of the ECG, maybe we would expect older patients to experience a higher heartrate, etc. We will explore these possibilities below:

Data Histograms

We'll begin our exploratory analysis by performing some basic histograms to get a basic idea of how patients with and without heart-disease are distributed across each of the variables. It is important for us to distinguish between 'categorical' (discrete) and 'continuous' variables for the purpose of the analysis, which we'll touch on again further in the project.

Categorical Variables

Interpreting the Histograms

Looking at the histograms, we are looking for meaningful 'separations' in each graph between patients with and without heart disease that would suggest that the variable in question has some correlation with the presence of heart disease. There are several observations:

First, we should note the demographic distribution of our data. We have roughly an equal number of subjects with and without heart disease, so our data is even in that regard. However, our data skews toward male patients over female patients, with a substantially greater percentage of male patients having heart disease than female patients. This would seem to suggest that male patients have a significantly greater rate of heart disease than female patients, but this could be merely a feature of our dataset.

As for patterns we notice: a significant majority of patients with heart disease seem to have the following attributes that distinguish them from patients without heart disease:

This would suggest that, based on our sample, the variables ('exerg', 'oldpeak', 'thalach', 'thal', 'ca', 'slope') might be good indicators of heart disease in a patient. Before moving forward with our analysis, let's check and see if any of these variables correlate with one another by using a heatmap.

Correlation Matrix

The correlation matrix corroborates some observations we already made: the variables that most correlate with 'target' are 'cp', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', and 'thal'. The correlation values aren't particularly large, but they are comparatively large compared to the rest of the dataset. So, we would like to examine the correlation between these variables and 'target' further with R2 tests.

First, however, we must take note of the correlation between the 'independent' variables: 'cp' and 'thal' do not correlate particularly strongly with variables other than 'target', but 'ca', 'thalach', 'exang', and 'slope' all have relatively high correlation with one another, equal to or even greater than their correlation with 'target', so we should consider examining these together with 'target'.

R2 Regression Tests

For further analysis, we would like to examine the strength of the variation of each independent variable and the dependent variable. Using a significance level of 0.05, based on each test we will determine if we can claim there is a statistically significant relationship between each set of variables.

First, we will examine the relationship between each pair of ('cp', 'thalach', 'exang', 'slope') to verify that we should consider these together in our test with 'target'.

Chest-pain Type (cp) vs. Exercise-induced angina (exang)

Analysis Summary: cp vs exang
Analysis Summary: cp vs oldpeak
Analysis Summary: exang vs oldpeak
Analysis Summary: oldpeak vs thalach

Maximum heartrate achieved (thalach) vs. Chest pain type (cp)

Analysis Summary: thalach vs cp
Analysis Summary: thalach vs exang

Multivariate Multiple Regression with cp, exang, oldpeak, thalach vs. Heart Disease (target)

From our R2 analysis of the relationships between cp, exang, oldpeak, and thalach, all but two of these had R2 values of over 25%, indicating that these variables have statistically significant correlations between them that we should consider together when calculating their correlation with the presence of heart disease. The reason we do this is so that we can more confidently say how much variation in heart disease is due to variation of our input variables. If we did not account for these interactions between the 'independent' variables, then this becomes more uncertain. For example, if we only considered the correlation between chest pain and heart disease, this does not account for how much chest pain may in turn be affected by maximum heart rate, which our tests concluded may be as high as 50%. This hurts our ability to truly understand the relationship between any one of these independent variables and the presence of heart disease. Thus, in considering all of these variables together with heart disease, we can more generally analyze how much variation in the presence of heart disease is due to all of these factors together, by considering the interaction terms between the input variables as well.

Analysis Summary: cp, exang, oldpeak, thalach vs target

Finally, we would also like to consider the correlations between ca and target, and thal and target:

Number of major vessels colored by flourosopy (ca) vs. Heart Disease (target)

Analysis Summary: ca vs target

Thalassemia (thal) vs. Heart Disease (target)

Analysis Summary: thal vs target

Machine Learning

Now that we have done a bit of analysis on factors that might help us predict the presence of heart disease, let's turn to using several machine learning models to implement this prediction. Machine learning is, generally speaking, a term used to describe a model that self-learns from data and make decisions based on minimal human-interaction. For the purposes of 'teaching' the models, we provide a randomly selected subset of our data as "training data," and use the unselected data as "test data." This is to ensure that the model isn't "overfitted" to our specific training data, i.e. essentially creating a model that works great for our sample, but doesn't apply itself very well to other data sets. More info on overfitting here For the purposes of our models, we'll use 80% of the sample as a training set, and 20% as a test set.

One thing we must first consider in building our model is handling our categorical data. For the purposes of a regression model, we need to change each of our categorical variables to "dummy" variables. Inherently, though we may represent something like chest pain as a number between 1 and 3, there is no mathematical reason why, for example, 2 * chest_pain_1 should equal chest_pain_2, but our regression model will infer this sort of mathematical relationship between our variables unless we make some changes in the way our data is stored. We do this by adding "dummy" columns for each possible value of each categorical variable, simply indicating a '1' or '0' corresponding to whether that categorical variable is present. This is displayed below.

More on dummy variables

Furthermore, since each of our variables has different scales, and we primarily want to compare the variations between them, we should also normalize our data before training our models on it.

Why it is important to normalize data

Logistic Regression Model

First, we'll use a simple model called Logistic Regression. This model focuses on optimizing an equation taking in a set of input variables and using them to predict our output variable, attempting to minimize 'loss.' Loss refers to the difference between what our model predicts our test values to be and what our test values actually are. In this context, it will be a measure of how often we correctly identify a patient as having heart disease or not.

Above is a violin plot demonstrating the range of our residuals (loss) between our actual and predicted data using the regression model, separated by the type of chest pain. The average value of our residuals tends to be about zero, but notice a few interesting observations: firstly, this model has a number of outliers that suggest that our logistic regression prediction could be better. Secondly, for patients who displayed chest pain of type 3 (asymptomatic chest pain unrelated to disease, our model was much more inaccurate at correctly predicting heart disease. Intuitively this makes some degree of sense, we would expect patients who already demonstrate chest pain indicative of some sort of angina to be more at risk of heart disease.

According to our accuracy score, our model had an 85.95% accuracy rate on the training set, and an 86.89% accuracy rate on the test set. This suggests that our model was fairly effective at predicting the presence of heart disease in patients. However, let's see if we can come up with an even better model, using some other machine learning models:

K-Nearest Neighbors

K-Nearest Neighbors is a classification model that tries to predict the value of an output based on similar "points" in the dataset "close-by" to it. While it's a bit difficult to visualize what this might look like in higher dimensions, in theory a multi-dimensional scatter plot of our input data will demonstrate some type of grouping that separates patients with heart disease from those without, and the k-nearest neighbors model will use a measure of "distance" to determine the class of each test input.

More info on K-nearest neighbors

According to our accuracy score, the k-nearest neighbors model performed slightly worse than the logistic regression model, with an accuracy of 85.25% on the test set. One reason for this discrepancy might be the large number of input variables in our dataset; with higher numbers of input variables, it becomes more difficult to calculate "distance"

Decision Tree Classifier

Next we will consider a decision tree model. A decision tree is made up of several nodes originating from one and then deciding which way to move down the tree based on the value of the input variables. In this way, the decision tree "splits" the data continuously by "asking questions" until it classifies the input as either having heart disease, or not. For this decision tree, we choose a max_depth of 10, to indicate that the tree will build at most 10 levels down into the tree to decide on classification.

In this case, our model on the training data performed with 100% accuracy, while it performed on the test data with 78% accuracy. The residuals also seem to include more outliers. This may suggest that our data is over-fitted to the training data, though is may also suggest that a decision tree is simply not the best choice for a model in this case, since lowering the number of levels did not improve the accuracy %.

Support Vector Machine Classification

Next we will try another model, the Support Vector Machine (SVM) classification. Another classification method, this one focuses on creating a 'dividing' line for classifying our data. In particular, this model is effective at dealing with high-dimensional data, so we might expect it to perform better.

Support Vector Machine

According to our accuracy score, the SVM model performed about as well as the logistic regression model, with a test accuracy percentage of 86.89%. The residuals also appear relatively the same as in the regression model. The reason it looks so similar to the logistic regression model results may be due to the relatively small size of the dataset (<500)

Random Forest Classifier

Lastly, we'll try the Random Forest Classifier model. Similar to decision trees, a random forest classifier consists of a "forest" of decision trees, using randomly selected observations and features to classify data, rather than a strict input, averaging the results of all of the decision trees at the end. In this way, it lessens the risk of overfitting that decision trees might bring. For our decision tree, we'll choose an n_estimators value of 1000, which means we'll use 1000 decision trees and average their result. Generally speaking, increasing this parameter will improve the result, but increase the computation time.

More info on random forest classifiers

The random forest classifier has our best accuracy percentage, of 90.16% on the test data. This can likely be attributed to the randomized nature of the tree, and the large number of estimators used in building the model.

Conclusion

Predicting the presence of heart disease is a difficult task. It is a difficult process reliant on a ton of different factors that our analysis was not able to consider. For one, heart disease refers to a wide variety of diseases. It is likely that many of the variables we considered in this project affect different sets of these diseases in different ways. There are also many commonly considered factors that this project did not take into account. Some of the best known indicators are smoking, diabetes, weight, diet, physical activity, and alcohol use, none of which we touched on in this project. Furthermore, our dataset considered only patients from 4 particular hospitals, and a small set of patients at that. We do not know the demographics of these individuals, so we cannot know for sure how skewed the data might be in that regard. To know more, we would have to find another dataset, make sure the data is comparable, and run our models on it.

However, we did find some interesting insights from out data: namely, at least for our dataset, the random forest classifier did a pretty accurate job of predicting heart disease from purely the factors we considered. We established a significant correlation between several of our variables and the presence of heart disease, and we noticed that it was easier to predict the presence of heart disease in patients whose chest pain was not asymptomatic.

Further Information

Predicting heart disease: