Prediction Stroke Patients (EDA)

I attended the healthcare management data challenge from Uconn last year. It gives me a better understanding of the healthcare industry. The result will not be allowed to show due to confidentiality reasons though, here I pick up a similar dataset from Kaggle and show what I coded in the data challenge. I hope you will enjoy it!

Photo by National Cancer Institute on Unsplash

In this article, I will share data processing and data exploratory using Python.

Data sources: Stroke Prediction Dataset from Kaggle

Let’s get started…

First, I import the required package: numpy, pandas, matplotlib, seaboard, and sklearn .

import numpy as np
import panada as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor

Next, find out whether there is any missing value in our dataset. We can see that “BMI” has 201 missing values.

df.isnull().sum()

So, how can we deal with blanks in the data? One can simply drop these records, fill the blanks with the mean, or the median. Here I will use a Decision Tree to predict the missing BMI.

DT_bmi_pipe = Pipeline( steps=[
('scale',StandardScaler()),
('lr',DecisionTreeRegressor(random_state=42))
])
X = df[['age','gender','bmi']].copy()
X.gender = X.gender.replace({'Male':0,'Female':1,'Other':-1}).astype(np.uint8)

Missing = X[X.bmi.isna()]
X = X[~X.bmi.isna()]
Y = X.pop('bmi')
DT_bmi_pipe.fit(X,Y)
predicted_bmi = pd.Series(DT_bmi_pipe.predict(Missing[['age','gender']]), index=Missing.index)
df.loc[Missing.index,'bmi'] = predicted_bmi

print('Missing value: ', sum(df.isnull().sum()))

We’ve replaced all missing values. Now, we can move to the next step.

Data Exploratory (EDA)

First, let’s look at the numeric/continuous variable distribution.

# pick up continuous variables
variables = [variable for variable in df.columns if variable not in ['id','stroke']]
conts = ['age','avg_glucose_level','bmi']
# Plot numeric variable distribution
fig = plt.figure(figsize=(12, 12), dpi=150, facecolor='#fafafa')
gs = fig.add_gridspec(4, 3)
gs.update(wspace=0.1, hspace=0.4)

background_color = "#fafafa"

plot = 0
for row in range(0, 1):
for col in range(0, 3):
locals()["ax" + str(plot)] = fig.add_subplot(gs[row, col])
locals()["ax" + str(plot)].set_facecolor(background_color)
locals()["ax" + str(plot)].tick_params(axis='y', left=False)
locals()["ax" + str(plot)].get_yaxis().set_visible(False)
for s in ["top", "right", "left"]:
locals()["ax" + str(plot)].spines[s].set_visible(False)
plot += 1

plot = 0
for variable in conts:
sns.kdeplot(df[variable], ax=locals()["ax" + str(plot)], color='#0f4c81', shade=True, linewidth=1.5, ec='black',
alpha=0.9, zorder=3, legend=False)
locals()["ax" + str(plot)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1, 5))
# locals()["ax"+str(plot)].set_xlabel(variable) removed this for aesthetics
plot += 1

ax0.set_xlabel('Age')
ax1.set_xlabel('Avg. Glucose Levels')
ax2.set_xlabel('BMI')

ax0.text(-20, 0.022, 'Numeric Variable Distribution', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(-20, 0.02, 'We see a positive skew in BMI and Glucose Level', fontsize=13, fontweight='light',
fontfamily='serif')

plt.show()

Now, we’ve gained some understanding of the distribution of numeric variables, but we can add more information to this plot.

Let’s see how the numeric variable distribution is different for those who have strokes, and those who don’t.

fig = plt.figure(figsize=(12, 12), dpi=150, facecolor=background_color)
gs = fig.add_gridspec(4, 3)
gs.update(wspace=0.1, hspace=0.4)

plot = 0
for row in range(0, 1):
for col in range(0, 3):
locals()["ax" + str(plot)] = fig.add_subplot(gs[row, col])
locals()["ax" + str(plot)].set_facecolor(background_color)
locals()["ax" + str(plot)].tick_params(axis='y', left=False)
locals()["ax" + str(plot)].get_yaxis().set_visible(False)
for s in ["top", "right", "left"]:
locals()["ax" + str(plot)].spines[s].set_visible(False)
plot += 1

plot = 0

s = df[df['stroke'] == 1]
ns = df[df['stroke'] == 0]

for feature in conts:
sns.kdeplot(s[feature], ax=locals()["ax" + str(plot)], color='#0f4c81', shade=True, linewidth=1.5, ec='black',
alpha=0.9, zorder=3, legend=False)
sns.kdeplot(ns[feature], ax=locals()["ax" + str(plot)], color='#9bb7d4', shade=True, linewidth=1.5, ec='black',
alpha=0.9, zorder=3, legend=False)
locals()["ax" + str(plot)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1, 5))
plot += 1

ax0.set_xlabel('Age')
ax1.set_xlabel('Avg. Glucose Levels')
ax2.set_xlabel('BMI')

fig.legend(labels=['Stroke','No Stroke'])

ax0.text(-20, 0.056, 'Numeric Variables by Stroke & No Stroke', fontsize=20, fontweight='bold', fontfamily='serif')
ax0.text(-20, 0.05, 'Age looks to be a prominent factor - this will likely be a salient feautre in our models',
fontsize=13, fontweight='light', fontfamily='serif')

plt.show()

According to the plot, it seems clear that Age is a big factor in stroke patients. The older you get the more at risk you are.

The distribution of average glucose levels and BMI both do not show a big difference between stroke and no stroke status.

Let’s explore those variables further…

str_only = df[df['stroke'] == 1]
no_str_only = df[df['stroke'] == 0]

# Set up figure and axes
fig = plt.figure(figsize=(10, 16), dpi=150, facecolor=background_color)
gs = fig.add_gridspec(4, 2)
gs.update(wspace=0.5, hspace=0.2)
ax0 = fig.add_subplot(gs[0, 0:2])
ax1 = fig.add_subplot(gs[1, 0:2])

ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)

# glucose

sns.regplot(no_str_only['age'], y=no_str_only['avg_glucose_level'],
color='lightgray',
logx=True,
ax=ax0)

sns.regplot(str_only['age'], y=str_only['avg_glucose_level'],
color='#0f4c81',
logx=True, scatter_kws={'edgecolors': ['black'],
'linewidth': 1},
ax=ax0)

ax0.set(ylim=(0, None))
ax0.set_xlabel(" ", fontsize=12, fontfamily='serif')
ax0.set_ylabel("Avg. Glucose Level", fontsize=10, fontfamily='serif', loc='bottom')

ax0.tick_params(axis='x', bottom=False)
ax0.get_xaxis().set_visible(False)

for s in ['top', 'left', 'bottom']:
ax0.spines[s].set_visible(False)

# bmi
sns.regplot(no_str_only['age'], y=no_str_only['bmi'],
color='lightgray',
logx=True,
ax=ax1)

sns.regplot(str_only['age'], y=str_only['bmi'],
color='#0f4c81', scatter_kws={'edgecolors': ['black'],
'linewidth': 1},
logx=True,
ax=ax1)

ax1.set_xlabel("Age", fontsize=10, fontfamily='serif', loc='left')
ax1.set_ylabel("BMI", fontsize=10, fontfamily='serif', loc='bottom')

for s in ['top', 'left', 'right']:
ax0.spines[s].set_visible(False)
ax1.spines[s].set_visible(False)

ax0.text(-5, 350, 'Strokes by Age, Glucose Level, and BMI', fontsize=18, fontfamily='serif', fontweight='bold')
ax0.text(-5, 320, 'Age appears to be a very important factor', fontsize=14, fontfamily='serif')

ax0.tick_params(axis=u'both', which=u'both', length=0)
ax1.tick_params(axis=u'both', which=u'both', length=0)

# Add legend to the plot
fig.legend(labels=['Stroke','No Stroke'])

plt.show()

As we suspected, Age is a critical factor, and also slight relationships with BMI and Average Glucose Level.

Next, I will visualize how age increases, the risk of having a stroke raises too.

fig = plt.figure(figsize=(10, 5), dpi=150,facecolor=background_color)
gs = fig.add_gridspec(2, 1)
gs.update(wspace=0.11, hspace=0.5)
ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)


df['age'] = df['age'].astype(int)

rate = []
for i in range(df['age'].min(), df['age'].max()):
rate.append(df[df['age'] < i]['stroke'].sum() / len(df[df['age'] < i]['stroke']))

sns.lineplot(data=rate,color='#0f4c81',ax=ax0)

for s in ["top","right","left"]:
ax0.spines[s].set_visible(False)

ax0.tick_params(axis='both', which='major', labelsize=8)
ax0.tick_params(axis=u'both', which=u'both',length=0)

ax0.text(-3,0.055,'Risk Increase by Age',fontsize=18,fontfamily='serif',fontweight='bold')
ax0.text(-3,0.047,'As age increase, so too does risk of having a stroke',fontsize=14,fontfamily='serif')


plt.show()
from pywaffle import Waffle

fig = plt.figure(figsize=(7, 2),dpi=150,facecolor=background_color,
FigureClass=Waffle,
rows=1,
values=[1, 19],
colors=['#0f4c81', "lightgray"],
characters='⬤',
font_size=20,vertical=True,
)

fig.text(0.035,0.78,'People Affected by a Stroke in our dataset',fontfamily='serif',fontsize=15,fontweight='bold')
fig.text(0.035,0.65,'This is around 1 in 20 people [249 out of 5000]',fontfamily='serif',fontsize=10)

plt.show()

You may notice that the low-risk values on the y-axis. This is because the dataset is imbalanced. Only 249 strokes are in our dataset which totals 5000- around 1 in 20 people. I will make the dataset balanced using the package SMOTE for oversampling in the next article.

General Overview

We’ve assessed a few variables so far, and gain some powerful insights.

Last but not least, I’ll plot several variables in a place to spot any trends of our features.

You can see some other interesting insights beyond categorical variables.

  • People who formerly smoked tend to get strokes. However, the ones who never smoked cannot see a trend influencing strokes or not
  • People who work for a private company tend to get a stroke.

In conclusion, “Age” is still the most important factor among other variables.

Indeed, anyone can have a stroke at any age. But you cannot deny that senior adults are more at risk of developing high blood pressure, high cholesterol, heart disease, and diabetes, the common causes of stroke. Therefore, the best way to protect you and your lovely family is to understand your risk and how to control it.

** To make the article clear and focus more on the insights, the codes of the last plots are not shown over here. However, in the next article on prediction stroke patients, I’ll share the code from my GitHub.**

Thank you for reading! See you next time:)

Data Analyst. Master’s in Business Analytics @UConn. More than 2 years experience in data management.