Before Analysis, Get Used to Cleaning Data First

Data cleaning is a mandatory activity that must be done when analyzing data. No matter how good you are at processing data, if the data you are analyzing is still messy and not clean, you can be sure that your data analysis results will be poor.

On this occasion, I will discuss what data cleaning is, what needs to be considered, and what steps need to be taken for data cleaning.

What is data cleaning?

Data cleaning is the process of correcting and identifying the data we are going to analyze. The data cleaning process can include handling missing values, duplicate data, and inconsistent formats. The purpose of data cleaning is to ensure that the data used is of high quality so that it produces the best possible analysis.

What you need to pay attention to

If you are new to data science, you will generally experience confusion about what parameters need to be considered in order for data to be considered clean or valid. Here are some things you need to do when cleaning data.

Data quality

Pay attention to the following data quality parameters so that the data you analyze is maximized.

Completeness: Does your data have missing values? Is the data you need there?
Relevance: Is the data you have useful for your analysis? If the data being analyzed is not useful, it will only make the process more difficult.
Consistency: Does the data you use have the same format? For example, is the weight of all products measured in kilograms, or are other units of measurement used?
Connectivity: Is the existing data interconnected? And can it be easily combined if necessary?
Accuracy: Is all data in its proper place? For example, does the product ID number match the product in reference?

Types of data errors

There are several types of data errors that commonly occur. Identify the errors so that you know what you need to do.

A missing value is a condition where there is empty data in a dataset.
Duplicate data, there are two identical data points in the dataset.
Outlier, an outlier is data that is different from the data in the dataset.
Inconsistent formats (e.g., weight units are in kg and lb)
There is unnecessary data (for example: if we want to analyze sentiment, symbols such as 😄💀@!#$ are not needed).

Data cleaning steps

After understanding what needs to be considered during data cleaning, it is now time to learn about the data cleaning process.

Data exploration

The first step is to explore the data, which involves checking whether the data you are using has missing values, duplicates, inconsistencies in units, or characters that you do not need. You can use Excel, R, or Python for this.

Handling missing values

Remove: If there are not too many missing values in the dataset, they can be removed.
Fill in: Use this method to fill in blank data.
- Use the mean, median, or mode for numeric data types.
- Use interpolation, also known as prediction, for time series data.
- Use default data for categorical data. (For example, in the gender column, if there is data with the name Ahmad Yusuf, you can label it as M).

Handling duplicate data

The first step you need to take is to identify whether duplicate data is allowed in the column or not. For example, in the gender column, you will definitely find a lot of duplicate data. Then, if there is unnecessary duplicate data, you need to delete it.

Handling outliers

Identify: Use statistical methods such as standard deviation, or use box plot visualizations to make it easier.
Handling: For outliers, you can delete, replace, or learn more according to your business requirements.

Data consistency

Standardize data formats: Ensure uniform formats, for example:
- Weight using the same unit, for example, only using kg without the lb unit
- For example, a person’s name using all lowercase letters
Data normalization: For example, if the data contains identical values but in different formats. For example, if the gender column contains “male” and “man,” you must change them to the same format.

Documentation

Document every step of the data cleaning process. This is necessary because if at any point we need to open the code that was created, we will understand how the coding works.

Mistakes should be avoided.

Deleting too much data: Do not delete too much data, especially empty data, because it can still be filled with the mean, median, or mode. Deleting too much data will affect the quality of your analysis.
Don’t validate data: Do not forget to check whether the data used is fully necessary for analysis.
Forget to document: Don’t forget to always record the changes you make to the data you handle. This will simplify the analysis development process.

Case study

To gain a better understanding of how to clean data, let’s try to learn together using the following dataset.

The data I used came from Kaggle. Generally, data from Kaggle is clean, meaning it does not contain duplicates or missing values. Since the data is clean, I will modify it according to my requirements. For example, I want to create machine learning that predicts chocolate sales at a certain time.

The data I need is the data in the “Amount” column and also the data in the “Date” column. To be able to analyze the Amount column, it is necessary to remove unnecessary attributes such as the $ symbol and also the comma.

# Convert "Amount" to Numeric
df['Amount'] = df['Amount'].replace('[\$,]', '', regex=True).astype(float)
df["Amount"].head()

Then, in the Date column, I will change the date format from 04/01/2022 to 2022-01-04. Here is the code.

# Date formating
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

The following is a clean data table that meets the requirements. Basically, data cleaning cannot be determined in specific stages because each dataset and problem to be answered has its own uniqueness. Always make sure you clean the data according to the context of the problem you want to answer.

Also read: Statistics in data science

References

Data Cleaning: Langkah Fundamental Sebelum Analisis dan Modeling Data

Data Cleaning in ML

Chocolate Sales Intelligence: Revenue Optimization