Exploratory Data Analysis (EDA) – Example on Health Care Research (Part-2)
Let us perform Exploratory Data Analysis on a healthcare dataset using Python. The dataset used for this example is Stroke Prediction Dataset from Kaggle. We start by importing all necessary libraries for performing EDA.
import NumPy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt
Then,
we can read in the data as a pandas data frame.
df=pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
The dataset contains 5110 individuals’ data with 12 features. It has different features like - id, gender, age, hypertension, heart disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status, and stroke. Using the df.head() command, we can print the first five rows of the dataset.
df.head()
We
can now conduct the EDA after importing the dataset. To get the basic
information on the dataset, i.e., the number of null values, data types, and
memory utilization, run the info() command:
df.info()
The
above output shows that the attribute 'id' contains 5110 unique values,
attributes (age, bmi, avg glucose level) are numerical. In contrast,
attributes (gender, hypertension, heart disease, ever married, work type,
Residence type, smoking status, stroke) are categorical.
First,
we will find the missing values. To print the exact number of missing values,
run the following command:
print("There are {} missing values in the data.".format(df.isna().sum().sum()))
As
missing values affect the outcome during analysis, we will replace the null
values of bmi with the mean of the bmi column and check
again to ensure that all missing values in the dataset have been correctly
replaced.
df.bmi.replace(to_replace=np.nan,value=df.bmi.mean(), inplace=True) print("There are {} missing values in the data.".format(df.isna().sum().sum()))
It's
important to search for outliers in the bmi variable and identify how many of
them have outcome associated with it.
bmi_outliers=df[df['bmi']>50] bmi_outliers['bmi'].shape
The
dataset has 79 outliers in total. However, there is only one value that has the
possibility of getting a stroke. Here, we will replace
the bmi outliers with the mean.
df["bmi"] = df["bmi"].apply(lambda x: df.bmi.mean() if x>50 else x)
We
will use the matplotlib library to visualize the relationship between different
variables in our dataset. The analysis is shown below with a few graphical
visualizations and their interpretation.
Bar Chart (Gender Distribution)
Looking
at the gender distribution in the dataset indicates that 59% of females and 41%
of men have one value labeled as 'other.' To simplify the data, we can
transform this single variable to male.
df['gender']=df['gender'].replace('Other','Male')
Now
that the outliers and missing values have been properly set up, it's time to
create some more graphs from the data to discover additional
information. Let us create a pie chart for the dataset's outcome distribution
‘stroke’. We can use the following code -
fig, ax = plt.subplots(1,1, figsize = (6,6)) labels = ["No Stroke", "Stroke"] values = df['stroke'].value_counts().tolist() ax.pie(x=values, labels=labels, autopct="%1.1f%%", shadow=True, startangle=45,explode=[0.01, colors=['#AF4343', '#C6AA97']) ax.set_title("Stroke", fontdict={'fontsize': 12},fontweight ='bold')
Similarly,
we can plot pie charts for other variables in the dataset.
All of these charts reveal a lot of important information about the dataset, such as:
·
Only 5% of people are at risk of having a
stroke.
·
Less than 10% of people have hypertension.
·
A bit more than 5% of people have heart
disease.
·
The dataset has an equal distribution of
residence types, with 50% of the population coming from rural regions.
· Over 65% of individuals are married, and 57% work in the private sector. 37% of people don't smoke at all.
Next,
we will plot a correlation matrix that gives us a general understanding of the
correlations between the input and the target variables.
Example
2: EDA in Retail
In
the retail industry, EDA can be performed on a dataset consisting of various
columns such as product categories, sales, price, discounts, region of sales,
orders, etc., for understanding sales patterns, improving inventory management,
predicting future demands, etc. You can follow the steps mentioned in the
previous example to practice EDA for a Superstore Sales
Dataset available on Kaggle.
Example
3: EDA in Electronic Medical Records
An
important aspect for organizations in the healthcare domain is maintaining
electronic medical records. These are digital records of the medical history of
the visiting patients, such as any previous hospitalization, administered
medicines, allergies or vaccinations, etc. You can explore the UCI
repository Diabetes 130-US hospitals for years 1999-2008
Data Set for practicing EDA on similar
lines as given in the previous example.
Conclusion
In
this article, we have understood the basics and importance of EDA, its role in
data science, and its key objectives. We discussed the different types of EDA
and its tools, along with a brief comparison of graphical vs. non-graphical
EDA. We learned a step-by-step example of an EDA for the healthcare
field.
Some
key takeaways from this article are:
- EDA is
subjective as it summarizes the features and characteristics of a dataset.
So, depending on the project, data scientists can choose from the various
plots discussed in this article to explore the data before applying
machine learning algorithms.
- Since the
nature of EDA depends on the data, we can say that it is an approach
instead of a defined process.
- EDA
presents hidden insights from data through visualizations such as graphs
and plots.
- Graphical
and non-graphical statistical methods can be used to perform EDA.
- Univariate
analysis is simpler than multivariate analysis.
- The
success of any EDA will depend on the quality and quantity of data, the
choice of tools and visualization, and its proper interpretation by a data
scientist.
- EDA is
crucial in AI-driven businesses such as retail, e-commerce, banking and
finance, agriculture, healthcare, and so on.
Comments
Post a Comment