Combining data sets

Recently at work we ran into a situation where my coworker had extracted the names of all of the current members from a spreadsheet, and asigned access codes to them. Afterwards, we realized that we also needed the email addresses, so we had to make sure each person had both the right code and the right email address.

They sent me a copy of the full member list and the one with the codes, I extracted what I thought was all the current members from the full one. However, I ended up with 2 more names than the code list had. Clearly, there was a mis-match somewhere.

So I took the sheet with the code list, which had the names in column A and the codes in column B, and added my list of names and email addresses in columns F and G. Then I named the columns NAME1, CODES, NAME2, and EMAIL. After that I went to column D and entered the following formula:

=IF(NAME1=NAME2, “”, “check”)

In other words, if the name in the codes set is the same as the name in the email set for that row, the IF() statement returned a blank. However, if they weren’t the same the word “check” appeared.

Since both columns of names had been sorted alphabetically, this was essentially the same as just putting the two lists side by side and visually inspecting them to make sure they matched. However, because the IF() statement did the checking for me, all I had to do was scroll down the list until I saw a “check” in column C. In fact, since every name after the inserted name would be off by one row, all the following rows would be marked “check”. This made it really easy to spot where the problem was… I could either scroll down or, if that seemed too tedious, I could find cells with the value of “check”. This is one instance where it really mattered whether I was searching the formulas or the results, since every formula had “check” in it, but only the important results had it.

Once I’d located the first problem I could look at it and decide how to handle it. Since there were only two problems, this didn’t take long. In each case I just selected all of the cells in NAMES2 and EMAILS that were below the unwanted name, plus another blank pair of cells at the bottom, then copied it, went up one row to the problematic name, and pasted it.

This approach has a number of potential ways it could get messed up. Here are some of the most likely problems, what caused them, and how to fix them.

Problem: Last name and email appear twice.
Cause: You didn’t select the blank cells below the last entry.
Solution: Delete the 2nd entry.

Problem: Error message appears in the IF() column.
Cause: You used cut & paste, instead of copy & paste. (This is more likely to happen if you use relative references, e.g. a1=f1, instead of named columns.)
Solution: Copy a cell higher in the IF() column that isn’t showing an error, and paste it in the cell(s) with the error message.

Problem: There’s one more email address than there are names.
Cause: You only selected the name column when you were copying & pasting.
Solution: If it’s obvious where the extra email address is, you can remove it the same way you removed the name. If not, you’ll have to use ctrl-z or some other Undo, and do the whole copy/paste process again.

If you’re not comfortable using copy & paste in this way, here’s an alternate method: Simply change the unwanted name to “zzzz” or some such, and then re-sort the data. Unless you have some exceedingly unusual names the “zzzz” will sort to the bottom of the list, and you can just delete it. If you prefer this method it’s very important that you have a blank column between this data set and everything else — otherwise, you’ll be sorting everything at once, and they’ll never match up.

Once the numbers of names matched, everything should have been fine (and, in fact, was). However, to confirm this I used countif() to count the number of times “check” appeared in column D. Even if the numbers had matched from the beginning I would have done this, since a quick double-check never hurts and is generally good practice. I could have just manually scrolled down and visually looked for cells with “check”, but scrolling through 800+ rows can get tedious very quickly and there’s always the chance that I’d accidentally miss one. So I used countif() instead, and when the result was zero I knew for certain that everything was right.

After that it was just a matter of deleting all the unnecessary columns: the 1st blank column, the formulas, the 2nd blank column, and the 2nd column of names (i.e. C through F). This put the email addresses right up next to the codes and names, with no extraneous information to confuse matters.