Are You Sure Your Data Is Completely Clean? Use This Checklist to Help.
The very first time I analyzed data was in a course for a degree I am working on. We used a tool called WEKA that makes it incredibly easy to run machine learning algorithms. You just upload the data, pick the features you want to use and then choose an algorithm. (Of course, it takes a lot more than that to actually get a useful model.) Because it was so easy, I started experimenting with running models on the data we were given before even looking at the data directly. I was amazed at how quickly you could get results. At the time, I didn’t realize how meaningless the results were as I hadn’t cleaned or processed the data in any way. When I finally got around to looking closer at the data, it turned out there were a lot of issues with it that needed to be addressed. I spent most of the rest of the course addressing these issues before I got around to using WEKA again to make the models I would use for an end of the course project.
Most of the resources the school provided and the ones I found myself focused on learning how machine learning algorithms worked. Besides reading about how data cleaning was crucial and time-consuming, there wasn’t much information on cleaning and processing it. As a result, I cleaned and processed items as I noticed issues. I did a lot of the processing based on the instinct of what seemed to make sense. (As I continued to learn more about data analysis later, I was pleased to discover that almost all of what I did is considered valid practice.) However, this ad-hoc method I used wasn’t the most efficient way to go about it.
I’ve seen statistics saying data professionals spend anywhere between 45 to 80 percent of their time on a project cleaning and processing data. The time I spent on that project was probably closer to 90%. Data cleaning and processing is a very time-intensive part of data analysis and one that is worth spending the time on and doing well. However, it also makes sense to do it more systematically. To help with this, I created a checklist.
The checklist applies to data found in “record” form. Record data is data with many instance “objects” each having identical attributes. This data is usually stored in a spreadsheet, csv file or database. Data in other forms, such as text data or data from web-scrapeing, will need different methods for cleaning and processing it. Below I have written a bit more about what each item on the checklist means and why it is important.
Types of Errors to Check For:
Missing Values: These will need to be dealt with in some way. Here are some things to consider:
- Is the reason why they are missing informative? Sometimes data is missing for a reason specific reason. If this is the case, finding out why it is missing could provide valuable insights, while simply removing it could negatively affect the analysis. For example, you might have a lot of missing data for the age attribute. Perhaps this is because most people over 60 don’t enter their age for fear of age discrimination. Removing all instances without an age first could hurt your analysis because you would lose insights into this segment of the population. For this reason, it can be worth looking into if there is something particular about the instances of missing data. One way of doing this is by filling in the data with a standard value (for example, “MISSING” and seeing if there are any unique patterns related to the missing data.)
- Is the missing data vital to the analysis? If you are trying to predict profits and 75% of the profits are missing, you will need to decide if you can proceed with less data or if you need to collect more. If 75% of the salespeople’s names are missing instead, you might still be able to predict profits without this information.
- If you decide to remove missing values, it is better to remove the instances where the values are missing or the attribute with the missing values? This is likely to depend on how important you think the attribute is and how many missing values there are.
- Should you fill in the missing values instead of removing them? This also depends on the number of missing values. If there are just a few, it might be valid to fill them in. However, filling them in would greatly affect the results if a large percentage of the data. If you choose to fill in the missing values, you will then need to figure out the best way for the data:
- Use the average of the nearest neighbors. For example, in a recent analysis, there were a lot of missing values for an attribute called “percent_use”. In this data set, there was another value called “engagement_index”. Both values were different ways of measuring how much users were engaging with websites. I looked for instances that had “engagement_index” scores that were very close to the “engagement_index” scores of the instances with the missing “percent_use” values. I then took the average of the “percent_use” for these “neighbors” and used it to fill in the missing values.
- Use the most common value. This works best for categorical data.
- Fill in random values weighted by the distribution of that particular attribute.
- Use the mean or median of the value. This only works for quantitative data. It can be a good idea to experiment to see which works best. If there are
Data Entry errors:
- Spelling mistakes: These can impact the results by creating extra “categories”. They can often be checked automatically. Another way to check is to look at the unique values in a field where something like “James” and “Jaems” would be easy to spot.
- Inconsistent units: Using different units of measure for the same attribute. For example, if some measurements are recorded in feet and others in meters.
- Data relationship errors: Sometimes, the variables for one instance won’t make sense in relation to each other. For example, if someone was born in 2002, but their age was recorded as 2 in 2022, there is likely to be a data entry error. You would then need to determine if the person was more likely to be 20 or born in 2020. This might be possible based on the other attributes.
- Invalid data: These are values that make no sense for the given attribute. For example, birth_date: 2120 or city: 1985.
- Out of range values: Some attributes have a specific range of valid values. For example, the day of the month should never be less than 1 or greater than 31. Check to make sure that all data fall within the range of acceptable values.
- Inconsistent recording of data. For example, some values might be written in upper case and others in lower case (New York and new York) or attribute names might be written using different styles like camelCase and not_camel_case. To avoid this issue, it is often appropriate to make all data upper or lower case. It is also a good idea to check if all data matches the proper format for a column. For example, maybe all state names should be recorded as an abbreviation.
- Extra white spaces: These should be removed as they will make the same value appear as different values if one of them has extra whitespace.
- Make sure the string encoding standard is suitable for the data you have. Do all of the characters display correctly. If not, use UTF-8 encoding.
Duplicated data: Duplicates should be removed as they will affect the results. It is easy with exact duplicates, but there can also be near duplicates. For example, the same person in a database with two slightly different names: “James M Brown” and “James Matthew Brown. These should be scrutinized to determine if they are actually two different people or the same person.
Data types: Make sure all data is formatted correctly for analysis. Ensure that dates are converted to date_time values. Numbers saved as strings should be converted to numbers. Currencies should be converted to numerical form if they are being stored as strings.
Outliers: Outliers can be because of errors, or they might be due to genuine data that is just unusual. If it is an error, it should obviously be corrected. If the value is legitimate, the decision is more complicated. Outliers can skew the results and make models less effective. On the other hand, excluding them might make the results less reliable by removing an important data point. Experimenting with different options regarding outliers can be a good approach.
Irrelevant Data: More is not always better with data analysis. In fact,using data with many attributes can make it harder to get meaningful results. For this reason, it is best to remove variables that are not relevant to your current analysis. A step beyond this is to use Feature Selection or Feature Engineering to choose the best features for the analysis you are planning to do. (However, this is probably best done after the rest of the cleaning is completed.)
Highly correlated data: If two attributes are highly correlated, including both can negatively affect your results. It is better to remove one of the attributes. Using a library such as pandas-profiling in Python can help you to quickly check for any high variables that are highly correlated.
Other considerations:
Timeliness: Does the data you have still reflect the current situation? If it is not recent enough, the results might not be valid.
Relevancy: Is the data relevant to the problem you are trying to solve?
Completeness: Do you have all the data you need to answer your question?
Data distribution: How is your data distributed? Data distribution can affect the analysis that is possible or how accurate some machine learning models are. Some algorithms or statistical tests work under the assumption that data has a normal distribution.
Imbalance: Are there any variables that are highly imbalanced in the data? Highly imbalanced data can affect the performance of some analysis or machine learning models. For example, a variable might have 90% or more of the data in one category and a tiny percentage in another. In this case, you would need to do additional pre-processing to address this before creating any models from the data.
Metadata: Do you have enough information about your data to ensure reliability? Is there information about what each of the variables mean, the units they represent, valid values and how it was collected?