Data Wrangling Using Python (Part 1) - Identifying and handling missing values.


There's always a pattern that can be found in "chaos". 


 Introduction.

A couple of years ago, I was approached by a recruiter because she found that my LinkedIn profile matched a particular job description she was recruiting for. She contacted me and suggested I email my CV to her inbox. I, therefore, sent my CV and later waited for her response from the company. The company was looking for someone to fill the Junior Data Analyst position.

These were her exact words from the company, "Daniel, although you have an impressive CV, the company says you do not have "enough" experience in "Data Visualization Techniques and Data Wrangling". I was okay with the response and took those words in good faith. It took me on a journey of self-introspection and a journey to start searching for courses, projects, and even the meaning of Data Wrangling although I "knew" what Data Wrangling is and I have enough experience in using Data Visualization tools like Power BI and Tableau with projects. I have all these on my GitHub Repository page.

Link to my Github page : https://github.com/Daniel-Kwasi-Kpeglo

Pre-requisite:

1. Data to be Analyzed (Very important). Data can be obtained from Kaggle for personal practise. 

2. Python Programming language and its libraries such as "Numpy", "Pandas", and Matplotlib".


What is Data Wrangling? 

To make sure your data is suitable for analysis and modeling, data wrangling is necessary. A vital phase in the data science, data analytics, and business intelligence process is data wrangling, sometimes referred to as data cleaning or data preprocessing, in which unprocessed data is cleaned, converted, and made ready for modeling and analysis. Data manipulation techniques are investigated, such as handling missing values, standardizing data, and identifying and correcting data types.

In this series, we will be dealing with some processes involved in Data Wrangling. Our next episode will deal with other cases of handling data before subjecting it to Exploratory Data Analytics techniques or Data Science Techniques.


1. Identifying and handling missing values

When data is collected and ready to be subjected to analytical techniques, the first thing an Analyst needs to do is to clean the data by checking for missing values. Missing values can greatly disrupt the outcome of your analysis in a way that can affect the intended purpose of the outcome. No matter the techniques you will be using to analyze the data, be it Regressional techniques (finding the relationship between attributes), Classification, Clustering, or even SVM (Support Vector Machines), e.t.c., missing data must first be dealt with.

When data is read and loaded into the Jupyter Notebook, there might be omissions such as NAN (not a number), blank spaces, and "?", signifying that data is missing under an attribute. We use the syntax;

"df.head()"; to display the data read from the source.

NB: 

 I hope we all know how to read data from other sources using the Python Library, "Pandas"?

syntax: df = pd.read_(type of data source)

a. Dealing with missing values

Missing values occur when no data value is stored for a feature in an observation. Within a particular attribute (column) or row,  an observable feature might have "?", 0, N/A, or just a blank cell. The Analyst must replace this value with "NAN". Below is the syntax;

.replace(A, B, Inplace=True)

A is the "outgoing" feature or feature to be "replaced".

B is the "incoming" feature or feature to "replace".

The code in the Cell of the Jupyter Notebook will read;

df.replace("?", np.nan, Inplace=True)

df.head(5);  Syntax to read the first 5 rows of the dataframe.

df.tail(5); Syntax to read the last 5 rows of the dataframe.

b. Evaluate for missing values.

Missing values are converted by default. You can use the following code to identify missing values. The outcome gives a "Boolean (yes or no, True or false)" outcome of the output. 

.isnull()

.notnull()

The syntax will read;

df.isnull().

Since we are working with missing data, we can write the outcome as;

missing_data = df.isnull()

missin_data.head(5); this displays the first 5 rows of the data frame.

This gives us a Boolean indicating "whether" the value passed into the argument is in fact a missing value. "True" in this sense means the value is indeed a missing value and "False" means, the value is not.

c. Count missing values in each column

To count the "number" of missing values in a dataset or dataframe, we use the syntax; ".value_counts()"

One may wonder why we must do this. Assuming you want to determine a dependent variable (outcome or response), based on certain indicators or independent variables, we can determine to either drop or replace the missing value in the row or column without affecting the outcome (this will be explained in detail later). To count the number of missing values, we write a function in Python;

for column in missing_data.columns.value.tolists()

    print(column)

    print(missing_data[column].value_counts())

    print(" ")

The function above gives us the number of missing data in each attribute or column.

d. Dealing with Missing Data.

These are the processes to deal with missing data as explained in the preceding topic.

i. Drop the data

    a. Drop the entire row 

    b. Drop the entire column

ii. Replace missing values

    a. Replace with mean

    b. Replace with frequency

    c. Replace it with any other function.

Columns should only be removed in their entirety if the majority of their entries are blank. If none of the columns in the data set are sufficiently empty, then do not drop the entire row or column. While you have some flexibility in replacing data, certain approaches could make more sense than others.

1. 

drop(); specifying the column or row to be dropped (axis = 0 or axis = 1). The columns can also be specified in a list.

drop(["column"], ["column"]); This is suitable for lables.

 2. The mean (avg) of the column can be calculated by using the syntax; 

.mean()

For instance, the average for a particular column can be;

avg_column = df["column"].astype(data type).mean(axis = 0)

print(avg of column, avg_colum)

After calculating the mean of the column, we used the ".replace()" syntax to replace that in the dataframe.

syntax; df["column1"].replace(np.nan, column1, inplace = True)

You can use the syntax "print(df)" to see the dataframe with the new set of mean values. 

For the most common value or feature (frequency), we can use the syntax;

idxmax()

syntax: df['column'].value_counts().idxmax().

The replacement procedure is just as similar to the mean value been replaced.

syntax: df["column"].replace(np.nan, "four", inplace=True)

You can use the syntax "print(df)" to see the dataframe with the new set of frequency values.


In our next episode, we will be discussing Data Normalization, Data standardization, and Binning.




References.

1.https://cognitiveclass.ai/






Daniel Kwasi Kpeglo

Physicist| SQL| Python| HTML| CSS| JavaScript| Looker| LookML| BigQuery| Power BI| Tableau| GCCP

Business page:https://dannyconsult.netlify.app/


Comments

Popular posts from this blog

Creating a Simple Server that sends the message, "Hello World" , using Nodejs

The DO's and DON'Ts of owning a Personal computer.

Does your business need a website?