Wednesday, June 17, 2020

Data Cleansing Tips using Excel


When you import data from different sources, you would like to do some initial assessment of the data and ensure it is cleansed. This is a preparatory step before deciding further statistical analysis. Here, I describe 6 important functions that are quite useful in data cleansing and understanding. 

Empty data

How do you remove the empty rows?  These can be a nuisance when you would like to do further operations on the data set. When you plan to apply regression and other modelling techniques, the empty cells can play truant.  

#1 – Find the empty cells of data. In the Toolbar Ribbon, go to “Find & Select” / “Go to Special”. Choose “Blanks”. It will highlight the blank rows.  

#2 – Fill or delete the empty cells. You can fill them up with any text. Once such cells are highlight (as a result of the earlier step), type the text you want and press CTRL + ENTER. All the empty cells will be filled up with whatever you have typed. If not, you may have to do this for every such empty cell. IF you want to delete, just right click and choose the option of “delete rows.” 

Duplicate data

#1 – Check for duplicate rows - Some transactions could be repeated accidentally. We need to ensure that the data set contains no duplicates.  Check for existence of such rows of data by using conditional formatting. Under that option, choose “Duplicate values” under “Highlight cell rules”. Of course, you can select any desired formatting. 

#2 – Delete the duplicate rows. For this, choose the relevant option under “Data”. You need to select which column on which this removal of duplicates will apply. 

Cleaning data

It is likely that some special characters might creep into the data when you are getting it from multiple sources and that too after some conversion. The data might come from application resulting in such special characters. 

Simply use the CLEAN function. For example, if the text is in cell B4, type in B5, CLEAN(B4). 

Remove unwanted spaces

Even after doing all these, there could be multiple spaces in between texts or at the beginning or at end. These are not removed by CLEAN function. For this, Excel gives us a beautiful function called TRIM. What does this do? This removes all multiple spaces and just keeps one space in between the words. Moreover, if there any leading spaces at the start of the data, that goes out. The same treatment is applied for the trailing spaces after the last word. 

Bring order

Especially for text fields, it may not be in a proper case. Upper and lower cases may have been used randomly. We have some beautiful functions to change the case. UPPER converts the text to full upper case; LOWER() changes to full lower case and PROPER() capitalizes the first alphabet of each word. Excel diligently applies this to texts embedded in numbers too. For example, a text in a cell “567BudGet” becomes 567Budget after PROPER() is applied. Unfortunately, in excel, we don’t have straightforward option for changing case or converting to Sentence case (that capitalizes only the first letter of the first word). For this, however, macros are available. 

Conditional Formatting 

I am sure you have tried using conditional formatting feature. I am a fan of it and use it all the time to focus on key data/rows. Few features you can apply on the data set (after cleansing it) to get a glimpse of outliers / exceptions are: 

  • Highlight cell values (that are greater or less than the average). For this, you can enter a custom formula using AVERAGE. 
  • Highlight the entire row values (meeting certain conditions). For this select the entire rows before defining your formula. 

You may want to do some selective data updates or cleansing on such rows before attempting further modelling. 

Good luck.