How big is big?

Big data is a popular concept, and problem, in the information world.  It’s things like data about every item bought by every person in every store of a large chain, every day of the year for multiple years.  That’s the sort of data that a major company needs to be able to track and mine, and special programs are needed to process and visualize this sort of thing.  In this example a chromogram was used to visualize the Wikipedia activity for just one city over the course of four months.

But for the quantity of data most people and individual businesses need to track, spreadsheets are enough.  Which is not to say that there aren’t limitations, or special tricks, because there definitely are.

How much data a spreadsheet can handle depends on the program, the computer it’s running on, and what you’re trying to do with it.  For example, Excel 2007 running on my laptop can handle a 50,000 line spreadsheet with multiple countif() statements, which are fairly memory intensive.  But a 700 line spreadsheet with multiple countifs() statements and a few vlookup() statements slowed down to a crawl every time it calculated, partially because because countifs() is much more memory intensive than countif().  That same spreadsheet worked just fine on my desktop.

Google Docs slows  down much earlier.  As far as I can tell it’s primarily geared towards small scale data storage and basic calculations, not intense statistical analysis of large data sets.  As a general rule if I have to do a lot of calculations, especially if I’ll be experimenting and building new formulas based on the results of the previous ones, I find that it’s worth it to download the data, do the calculations in Excel or Open Office, and then upload the results.

If you have to work with formulas and data sets that are so large they’re slowing down your computer, there are a number of tricks you can use to speed things up, ranging from changing the default settings on your program, modifying the amount of data analyzed, and changing your approach to the calculations.  I’ll discuss these approaches in more detail starting next week.