Working with big data sets: using different functions

By Walter Siegmund (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia CommonsAnother thing you can do to keep big data sets from bogging your calculations down is to change the functions you use to do the calculations.

To continue the boat analogy, changing the functions you use is like changing the materials you use to build the boat. An inflatable raft is going to be much lighter and easier to move than a wooden one, and either one is going to be much lighter than an ironclad ship. Similarly, there are some functions that have a higher calculation overhead than others, and some which can have dramatically different overhead costs depending on how they’re used.

In general, if you search for posts about optimizing efficiency in spreadsheets you’re going to get what seems like a lot of conflicting information. In one post it’ll tell you to avoid indirect() because it’s very high-cost, and in another post it’ll say to use indirect() to allow for a dynamically allocated countif() range. Fortunately, that sort of stuff is for very precise fine-tuning of efficiency, and I’ll leave that sort of thing for others to deal with. There’s a lot of options you can use to improve efficiency without getting into that level of precision.

For example, it’s universally agreed that having a countif() that’s larger than it needs to be is very wasteful. This makes sense — if you’ve only got 10 rows of data, but your countif() spans 100 rows, then that’s 90 rows it has to check for no good reason. But if you know that your data set will grow over time you can try to update your countif() every single time (very inefficient of your own time, and also prone to mistakes), you can force it to update itself by inserting the new data into the middle of the old data (this works if you never pass it on to anyone else and you never forget to do that yourself), you can make the range larger than you ever expect it to get (which as we just discussed is very inefficient), or you can use some other function that doesn’t require a pre-defined range.

In this sort of situation, I find that incremental formulas work better for me than =countif(). Instead of a b1 having the formula =countif(a1:a100,”x”) I’d do =if(a1=”x”,b2+1,b2). You do have to copy the formula down to the new rows every time you add more data, but if you don’t then you can see at a glance that you haven’t done so — it’s not hidden in a formula that you have to actively check. As a result, you always have exactly as many if() calculations as you need, and no more.

Instead of countifs() I’d use a pair of columns: one would concatenate all the pieces I wanted to check, and then an incremental if() to count the number of matches. For example, instead of having =countifs(a1:a100,”John”,b1:b100,”Doe”) in column C, I’d have =concatenate(a1,” “,b1) in c1 and =if(c1=”John Doe”,d2+1,d2) in d1.

You can do this trick with =sumif(), too. Instead of =sumif(A1:A100,B1:B100,”x”), the formula would be =if(B1=”x”,A1+C2,C2).

All of this isn’t to say that =if() solves all problems, or anything like that! Taking the example from last week, if you wanted to convert a numeral (1, 2, 3, etc.) to a word (one, two, three, etc.) there’s a much more efficient way than any of the ones I showed: =choose(). Instead of any of the extremely long formulas I explained last week you could just do =choose(A1,”one”,”two”,”three”,”four”,”five”,”six”,”seven”,”eight”,”nine”,”ten”)

Of course, =choose() only works if A1 is something the spreadsheet recognizes as a number. If you were trying to convert from the words to the numerals you couldn’t use =choose(), but you could use =vlookup(). You’d just set up a table with the words in one column and the numerals in the next column, and use a formula like =vlookup(A1,table_numbers,2,0).