Data Cleaning in Excel: IF statements and more
Jaimi Dowdell, IRE/NICAR
Often, government agencies try to make records look pretty by inserting headers, footers and other formatting to line up
data into neat columns and rows. While these formatting tricks may be visually pleasing, they dont always allow us to
easily handle data like we would if they were in standard columns and rows. A reporter encountered just this problem
when trying to handle a list of candidates running for office in Iowa. Take a look at Figure 1 below or go ahead and open
up the file Candidates.xlsx. Figure 2 shows the same data in a much easier to use format.
Figure 1
Figure 2
See the difference?
In this exercise, youll learn how to take the Candidates.xlsx file and make it useable ending up with what you see above.
But first, I want you to notice all of the problems with the format originally provided that could hinder things like simple
sorts, filters and more in Excel.
Here are two key issues:
1) There is a header “abovewhere the data rows begin. It includes the date, time and other information that isn’t
part of the actual data.
Figure 3
2) The information for each candidate isnt on one line. Rather, the address and other contact information are listed
on two rows. Also, the office is listed above a group of candidates. Any sorting at all in this current file would
separate some of the candidates information from their record.
Figure 4
Our goal is to get each candidate’s information on one line.
Step 1: Get rid of the header information and make it look like a normal spreadsheet.
Click on the View tab in Excel and look to the left-hand side of the page. The file is shown in Page layout view.
Change that to "normal." That will get rid of the headers and make it look more like a common spreadsheet. See Figure 5.
Figure 5
Step 2: Unmerge cells.
Highlight the entire sheet and right click somewhere on the page. In the menu that appears select Format cells. Find the
Alignment tab. In the middle of that window you should see the options for wrap text and merge text. They are both
filled in. Uncheck both of these as shown in Figure 6 and click OK.
Figure 6
Step 3: Bring the candidate addresses to one line.
Now, we need to bring the information for each candidate up on one line. The first thing we'll move up is the second part
of the address. Put your cursor in cell F3 and type in "Address2" to label our new column. Now, look at the pattern of the
data. The records always start with the office and party at the top followed by candidate names. There are blanks between
where the extra information about each candidate flows down. If you very literally look at it, you could say, "Whenever I
see a name in the A column, I know that I need to move other information up a level." For example, I'm going to use this
logic in an IF statement to tell Excel that if there is a value in cell A4 that I know the secondary address will need to move
up alongside that candidate's record in the F column.
Here's how I write that statement in cell F4:
=IF(A4>0,B5,"")
What this literally says is "If the contents in cell A4 are greater than nothing (i.e., there is a value) then put a copy of
whatever is in cell B5 in cell F4. If not, (if there isnt a name in A4), then leave the new Address2 column blank.
Copy that formula down throughout the document and you should see that the pattern works. Youll notice that we have
some additional information moved up a line that we dont need. Dont worry; well clean that up later.
Figure 7
Step 4: Move the email address up alongside the address
Label cell G3 as email.” You’ll move each candidate’s email next to the address we just moved using the same
pattern/method as we used in Step 3.
In cell G4 type this formula:
=IF(A4>0,D5,"")
Notice that its exactly the same as the previous formula only this time were selecting the email value for the second
argument. Copy the formula down throughout the document and you now have each candidates address and contact
information on one line.
Step 5: Cleaning up
Now that you have all of the candidate information on one line, we can clean things up a bit. First, highlight and copy the
entire worksheet. Click on the sheet 2 tab to get a fresh sheet. There, right click in cell A1 and select "paste special" then
"values." That'll get rid of the formulas, but you still have them in the first sheet just in case you want to double check your
work or if you see a problem later. It also gets rid of unnecessary formatting such as the shading found in the office and
party headings.
Step 6: The home stretch
Now that you've pasted the info without the formulas and formatting we can finish it up. Follow these steps to take you
(almost) home:
1) Delete the first blank row of the file in sheet 2 so that "U.S. Senate - Republican Party" is now in cell A1.
(Remember, to delete a row just right click on the row number at the left and select delete).
2) Select columns A through G and turn on the filter. The reason I want you to highlight the entire columns and not
just a chunk of data is that I want to make sure you don't miss out on any records below that you can't see. Look
at the filter options for column A. Uncheck the box next to Select all to uncheck everything then go all the way to
the bottom of the list and select just the blanks. You can do this because you know that if there isn't a candidate
name or party listed in column A, we now no longer need that row's information. With the filter on, delete all of the
rows that are blank in column A. Next, turn the filter off. That should leave you with 705 rows of information.
3) Turn the filters back on. Look under the column A filter again. This time uncheck everything except for
Candidates Name. This is our header information that is repeated over and over again. Delete the header
information for every row except the first one so that at the end youre left with labels for your columns. Well deal
with the party and office information in Step 7.
4) But first, notice that column C is completely blank. Double check this with a filter just to make sure there arent
some records you cant see. There arent, so you can delete that column. This should leave you with data in
columns A through F and 434 total rows of data.
Step 7: Add office and party information for each candidate
The last thing you need to do is add the party and office information for each candidate so we make sure we keep the
proper details with each candidate. Just like you did before, look at the patterns.
What jumped out at me is the fact that when a party/office is listed columns B, C and D are always blank. See Figure 8
where Ive highlighted rows with this information for you to better see the patterns.
Figure 8
Youll use this pattern to write one last IF statement to add the data for all of the candidates.
Label cell G2 as something like "Party_Office".
Our first office/party information, "U.S. Senate - Republican Party, is listed in cell A1. To make things a bit easier, copy
that and paste it into cell G3 which is the record for Sam Clovis.
Now is where the patterns come in. We know that for the next record, row 4, that if cell B4 is blank, then A4 is the party
information for that candidate and others until we reach another blank in column B. If B4 is filled in, we can assume that
weve moved on to the next office/party and that he candidate's party is the same party as the record above.
We write this statement like this in cell G4:
=IF(B4=0,A4,G3)
Figure 9
Now copy the formula down and double check to make sure the pattern worked throughout.
After you've looked at it all, you can copy and paste values special one last time. The last thing youll want to do is get
rid of the rows that just list the party/office. Turn on your filters and under Column B select the blanks. Delete those rows.
Delete the extra row at the top.
Youre done! You should end up with 161 candidates listed and one header row.