Exploratory Data Analysis (EDA) – Example on Health Care Research (Part-2)

Let us perform Exploratory Data Analysis on a healthcare dataset using Python. The dataset used for this example is Stroke Prediction Dataset from Kaggle. We start by importing all necessary libraries for performing EDA.

import NumPy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 


Then, we can read in the data as a pandas data frame. 

df=pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv") 


The dataset contains 5110 individuals’ data with 12 features. It has different features like - id, gender, age, hypertension, heart disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status, and stroke. Using the df.head() command, we can print the first five rows of the dataset. 

df.head() 


We can now conduct the EDA after importing the dataset. To get the basic information on the dataset, i.e., the number of null values, data types, and memory utilization, run the info() command: 

df.info() 


The above output shows that the attribute 'id' contains 5110 unique values, attributes (age, bmi, avg glucose level) are numerical. In contrast, attributes (gender, hypertension, heart disease, ever married, work type, Residence type, smoking status, stroke) are categorical. 

First, we will find the missing values. To print the exact number of missing values, run the following command: 

print("There are {} missing values in the data.".format(df.isna().sum().sum())) 


As missing values affect the outcome during analysis, we will replace the null values of bmi with the mean of the bmi column and check again to ensure that all missing values in the dataset have been correctly replaced. 

df.bmi.replace(to_replace=np.nan,value=df.bmi.mean(), inplace=True) 
print("There are {} missing values in the data.".format(df.isna().sum().sum())) 


It's important to search for outliers in the bmi variable and identify how many of them have outcome associated with it.  

bmi_outliers=df[df['bmi']>50] 
bmi_outliers['bmi'].shape 


The dataset has 79 outliers in total. However, there is only one value that has the possibility of getting a stroke. Here, we will replace the bmi outliers with the mean. 

df["bmi"] = df["bmi"].apply(lambda x: df.bmi.mean() if x>50 else x) 


We will use the matplotlib library to visualize the relationship between different variables in our dataset. The analysis is shown below with a few graphical visualizations and their interpretation. 

Bar Chart (Gender Distribution) 

Looking at the gender distribution in the dataset indicates that 59% of females and 41% of men have one value labeled as 'other.' To simplify the data, we can transform this single variable to male. 

df['gender']=df['gender'].replace('Other','Male')


Now that the outliers and missing values have been properly set up, it's time to create some more graphs from the data to discover additional information. Let us create a pie chart for the dataset's outcome distribution ‘stroke’. We can use the following code - 

fig, ax = plt.subplots(1,1, figsize = (6,6)) 
labels = ["No Stroke", "Stroke"] 
values = df['stroke'].value_counts().tolist() 
ax.pie(x=values, labels=labels, autopct="%1.1f%%", shadow=True, startangle=45,explode=[0.01, 
colors=['#AF4343', '#C6AA97']) 
ax.set_title("Stroke", fontdict={'fontsize': 12},fontweight ='bold') 


Similarly, we can plot pie charts for other variables in the dataset.  

All of these charts reveal a lot of important information about the dataset, such as: 

·        Only 5% of people are at risk of having a stroke. 

·        Less than 10% of people have hypertension. 

·        A bit more than 5% of people have heart disease. 

·        The dataset has an equal distribution of residence types, with 50% of the population coming from rural regions. 

·        Over 65% of individuals are married, and 57% work in the private sector. 37% of people don't smoke at all. 

Next, we will plot a correlation matrix that gives us a general understanding of the correlations between the input and the target variables.

Example 2: EDA in Retail 

In the retail industry, EDA can be performed on a dataset consisting of various columns such as product categories, sales, price, discounts, region of sales, orders, etc., for understanding sales patterns, improving inventory management, predicting future demands, etc. You can follow the steps mentioned in the previous example to practice EDA for a Superstore Sales Dataset available on Kaggle. 

Example 3: EDA in Electronic Medical Records

An important aspect for organizations in the healthcare domain is maintaining electronic medical records. These are digital records of the medical history of the visiting patients, such as any previous hospitalization, administered medicines, allergies or vaccinations, etc. You can explore the UCI repository Diabetes 130-US hospitals for years 1999-2008 Data Set for practicing EDA on similar lines as given in the previous example. 

Conclusion  

In this article, we have understood the basics and importance of EDA, its role in data science, and its key objectives. We discussed the different types of EDA and its tools, along with a brief comparison of graphical vs. non-graphical EDA. We learned a step-by-step example of an EDA for the healthcare field.  

Some key takeaways from this article are:

*       

  • EDA is subjective as it summarizes the features and characteristics of a dataset. So, depending on the project, data scientists can choose from the various plots discussed in this article to explore the data before applying machine learning algorithms.  
  • Since the nature of EDA depends on the data, we can say that it is an approach instead of a defined process.  
  • EDA presents hidden insights from data through visualizations such as graphs and plots. 
  • Graphical and non-graphical statistical methods can be used to perform EDA.  
  • Univariate analysis is simpler than multivariate analysis. 
  • The success of any EDA will depend on the quality and quantity of data, the choice of tools and visualization, and its proper interpretation by a data scientist. 
  • EDA is crucial in AI-driven businesses such as retail, e-commerce, banking and finance, agriculture, healthcare, and so on.

Comments

Popular posts from this blog

Langchain Language Model Brief Demo and How to Install Library, modules.