Removing duplicates from a list

Sometimes you have a long list of repeated items, and you just want one of each. Or possibly you have a long list of items that are supposed to be unique, and you want to make sure they are. Here’s a technique I use often to do exactly that, and like many others it depends on a simple IF statement:

=IF(b2=b1,”delete”,””)

What this does is check to see whether the value in cell B2 is the same as the value in the cell right above it, B1. If it is, then it says “delete”, and if it doesn’t it stays blank. You can put anything you want in as long as it makes sense to you. For example, it could be =IF(b2=b1,”duplicate”,”first”), so the first time a value occurs in the list it says “first”, and all other times it says “duplicate”.

Here’s the actual procedure for using this statement, assuming the data you want to check is in column B, and the formula is in column A.

  1. Add headers. If your columns don’t already have headers, just insert a row at the top, and add column headers. For the purposes of this explanation I’ll assume that they’re “Duplicates” and “Data” respectively.
  2. Sort both columns by the data column. Be sure to specify that you have headers when sorting, or else your headers will vanish into the data somewhere.
  3. Enter the IF formula. Enter =IF(b2=b1,”delete”,””) in cell A2.
    Note that you can’t use column names for this since there’s no way to refer to the row above the current one that way. You could, in theory, enter the formula =IF(Data=b1,”delete”,””), but that makes it harder to read so I don’t recommend it.
  4. Copy the IF formula down the rest of the column. I like to do this by selecting a cell in Data column, hitting ctrl-down arrow to zip to the last entry, moving back over to the formula column, and then selecting all the blank cells above that. If you also select the cell with the original header, Open Office will give you a warning asking if you want to overwrite the data. If you also select the header, which is in row 1, you’ll get a missing reference error. This is because you’re using relative references, and so the formula in A1 would be =IF(b1=b0,”delete”,””), except b0 isn’t a valid reference.
  5. Copy the entire formula column and paste it as values over itself. This means that even if you re-sort the data, the “delete” cells will still be marked “delete”, and the ones you want to keep would still be blank.
  6. Sort both columns by the formula column. This is why I have a blank for the values I want to keep, instead of saying “keep”. Otherwise, the default is to sort alphabetically in ascending order, which would put the keeps below the deletes, which makes the next step slightly harder. You could also fix this by specifying that you want to sort in descending order, but I never remember to do that until I’ve already done it the wrong way once.
  7. Delete all the rows marked “delete”. If you’ve sorted it so that the deleted values are after the ones you’re keeping, this is very easy. I like to select the formulas column and search in the selected area for “delete”, then just select everything below that by hitting shift-ctrl-down arrow and delete the rows.

And that’s it. I now have a complete list of every unique value from the original list, with no duplicates.