Working with big data sets: streamlining formulas

By Caseman (eigen werk, Pentax K10D + Tamron 18-250 mm) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia CommonsIf you’re working with data sets that are big enough to make your computer slow down or risk locking up when processing them, one of the things you can do to help ameliorate the problem is streamlining your formulas.

Just like with streamlining a boat will let it move faster through the water because there’s less resistance, streamlining a formula will let it execute faster, because it has (on average) fewer steps.

As I’ve discussed before, there’s almost always more than one way to do things in a spreadsheet. Most of the times I would encourage you to just use whatever method(s) you’re most comfortable with, but when it comes to formulas to process large data sets it can make a big difference, since some functions are definitely more memory intensive than others, and some methods of using them are less efficient, and therefore more intensive than others.

For example, suppose you wanted to check whether the value in cell A1 is between 1 and 10, and return the appropriate word (one, two, etc.). Here’s one way you could do that:

Ascending order:

=if(A1=1,”one”, if(A1=2,”two”, if(A1=3,”three”, if(A1=4,”four”, if(A1=5,”five”, if(A1=6,”six”, if(A1=7,”seven”, if(A1=8,”eight”, if(A1=9,”nine”, if(A1=10,”ten”, “Not between 1 and 10”))))))))))

This is certainly the most straightforward way, but is it the most efficient?

Calculating efficiency
Before you can evaluate whether there’s a more efficient way to write this formula, you need a way to figure out efficiency. Fortunately, when we’re working with all the same function, that’s pretty easy — the more functions a formula has to execute on average, the less efficient it is. When a spreadsheet executes an IF() statement, it does the TRUE result, or it does the FALSE result, but not both. So in this example, if A1 was 1, then the very first IF() would return true, the formula would return “one”, and the execution would be done have only 1 IF() check. However, if A1 was -3, then it would have to check ten IF() statements before it was done.

The average number of executions is equal to the sum of the required checks for each result times the odds of that result. So if the odds of getting a number that wasn’t between 1 and 10 were exactly equal to the odds of getting one within the 1-10 range, then the average number of IF() checks would be 1*1/11+2*1/11+3*1/11+4*1/11+5*1/11+6*1/11+7*1/11+8*1/11+9*1/11+10*1/11+10*1/11 = (1+2+3+4+5+6+7+8+9+10+10)/11=5.9. However, if A1 was any integer between 1 and 100, then the average would be (1+2+3+4+5+6+7+8+9+10+10*90)/100 = 955/100 = 9.55. If it were between 1 and 1000, the average would be 9.955.

As you can see, the data that you’re evaluating can have a big effect on the efficiency of the function.

Alternative structures
Fixed options
If you knew for a fact that A1 would be an integer in the 1-10 range, you could improve the efficiency of the original formula slightly by dropping the last if(), for an average of 5.4.

=if(A1=1,”one”, if(A1=2,”two”, if(A1=3,”three”, if(A1=4,”four”, if(A1=5,”five”, if(A1=6,”six”, if(A1=7,”seven”, if(A1=8,”eight”, if(A1=9,”nine”, “ten”)))))))))

If you do something like this, however, you want to be very certain that you’ve covered all of the possible values. It would, for example, work well if you’re trying to convert letter grades to numbers, in order to calculate GPA. Whether you’re restricting it to plain letters (A, B, C, D, F) or allow +/-, you still have a finite number and there’s no possibility of something outside of those values. You could even use it to covert from numbers to letters grades, because there’s a finite number of ranges even though there’s an arguably infinite number of possible numbers in each range. But if there’s any chance whatsoever that you’ve got something outside of the expected possibilities — say, for example, the data was entered by hand with no limits on the values that could be entered, or perhaps the data was scanned in and interpretted by OCR — you absolutely want to keep a last if() which covers such potential errors, even though it decreases efficiency a tiny bit.

Likelihood order
In the 1-100 scenario, you can drastically reduce the average number of executions with one simple change: move the most likely option to the beginning. For example:

=if(A1>10,”Not between 1 and 10″,if(A1=1,”one”, if(A1=2,”two”, if(A1=3,”three”, if(A1=4,”four”, if(A1=5,”five”, if(A1=6,”six”, if(A1=7,”seven”, if(A1=8,”eight”, if(A1=9,”nine”, “ten”))))))))))

This would make the average (1*90+2+3+4+5+6+7+8+9+10+10)/100 = 154/100 = 1.54 — much better! In the 1-1000 case it would be even better… (1*990+2+3…+10+10)/1000 = 1054/1000 = 1.054. This depends, however, on knowing that one choice is significantly more likely than the others.

Divide & conquer
Lastly, there’s an approach that you can use whenever you have ordered data, which is to say data which has a natural order, wherein one thing can be recognized as being less than another. It probably has many names, but I’m going to call it the divide and conquer strategy. Instead of testing them in order (one and then two and then three…), you could use something like:

=if(A1>5, if(A1>7, if(A1>8,if(A1=9,”nine”,”ten”),”eight”), if(A1=6,”six”,”seven”)), if(A1>2,if(A1>3, if(A1=4,”four”,”five”),”three”), if(A1=1,”one”,”two”)))

If you’re having trouble following that logic, here’s a quick decision tree to help illustrate it. Start at the top and follow the branches down, and the number of blue nodes you have to pass through to get to a particular green one is the number of checks the formula has to go through to get that answer.
10-choice DecisionTree

(And yes, I built that in a spreadsheet using font formatting, borders, and background coloring, since I didn’t like the results of trying to use any of the “smart art” options.)

Looked at this way, the number of checks you’d need to do in order to determine that the result is “one” is 3 (>5, >2, and =1), which is notably worse than the 1 you’d need in the in-order approach. However, the number of steps you’d need to find “ten” is only 4 (>5, >7, >8, and =9). The average would be (3+3+3+4+4+3+3+3+4+4)/10 = 34/10 = 3.4

That might not seem like a huge improvement over 5.4, but the larger the data set gets, the more of an improvement you get. If you had 16 items, the ordered approach would have an average efficiency of 8.5, but this approach would only be 4. At 32 items it would be 16.5 and 5, and at 64 it would, hypothetically, be 32.5 and 6.

I say hypothetically because pretty much any spreadsheet program is going to have a limit on how deeply you can nest if() statements. If you actually have to have an if statement that’s larger that that limit then dividing it up is the only way to do it.