Project: Finding files

Note: I planning to follow up last week’s post about the len() function by discussing another text function: find(). Things happened, however, including a friend in the hospital, so I’ll put that off until next week.

Instead I’m going to share a project that I recently did which relied heavily on the find() function, and somewhat on the len() function. Sorry about the order — think of it as an illustration-in-advance, and I’ll explain the details of how the function actually works next week.

Last week I had the need to take a good look at all the files in a website, and determine which ones were static and which ones were updated, and what schedule they were updated on.

Windows has a decent search function for fairly simple searches, but as far as I can tell (and this may just be my lack of skill) it completely breaks down when you want a really complex search. And then, of course, there’s the matter of getting the results of the search into a format that it can be given to someone else.

In this particular case, the website had over 3000 files. Some of the subdirectories needed to be checked, and others could be ignored. Certain types of files could be ignored. These are limitations that the Windows search function doesn’t handle well, if at all.

So of course I turn to my favorite tool for crunching large amounts of data: the spreadsheet. In the end, after going through this process I had dropped my list of 3000+ files to exactly 136. That’s few enough to go through by hand for which to add to my list. Although I could have just tried to generate the list off the top of my head, I almost certainly would have missed something, and even if I hadn’t I wouldn’t have been able to be absolutely sure I hadn’t. This way, I am.

I won’t say the process below is easy or simple because I know most people would look at it and laugh at me if I did, but it is straightforward and you can do it with just the skills I’ve discussed already.

Step 1: Get a list of all files. I use cmd.exe for this. Just run it, then enter cd.. until you’re at the root directory (usually c:), then enter “cd xxx” (no quotes) to navigate through the subdirectories to the main one for the website, where xxx is the name of a subdirectory of the current directory. Then enter “dir/s >files.txt” (again, no quotes). This will give you a list of all the files in the directory and all its subdirectories, and it will save it to a file in the current directory named “files.txt”. You can then open that file in Notepad, copy everything, and paste it into the spreadsheet. I recommend pasting it into column C.

Step 2: Add column headers. Insert a row at the top. Label column A “Order”. Label column B “Directory”. Label column C “Files”.

Step 3: Record the order. In cell A2, enter “1”. In cell A3, enter “=A2+1”. Copy cell A3. Paste it down enough rows to reach the bottom row of column C. Copy all of column A. Paste as values into column A.

Step 4: Populate directory names. The directory appears before the list of all the files in that directory. Since you’ll be sorting you need the directory on the same line as each file. Here’s how to do this.

Step 4.1: Enter “=B1” in cell B2.

Step 4.2: Copy B2 down the length of the Directory column.

Step 4.3: Sort by Files. Sort the data by the Files column.

Step 4.4: Search for “Directory of”. Ctrl-f, and you should find a section of the Files column that’s all “Directory of c:whateverdirectory”

Step 4.5: Copy all of the Directory entries to column B. Highlight all of the “Directory of” entries in the Files column. Copy them. Move left to the Directory column. Paste them.

Step 4.6: Sort by Order. This puts it back in the original order. Now all of the cells in the Directory column will have the same value as the one above them, except for the rows that have the directories in them.

Step 4.7: Convert the Directory column to values. Copy the column, paste it over itself as values. This removes the formulas.

Step 5: Remove all non-file lines. Only some of the lines contain files, and you no longer need the rest. You can remove them using the following steps.

Step 5.1 Sort by Files. Sort the data by the Files column.

Step 5.2 Find the start of the actual files. You can scroll down to do this, but an easier way is to search for “/”. In DOS, directories are separated by “”, but dates are separated by “/”. Since each file line has the date it was last modified, and no other line has dates, that’ll take you to the first file.

Step 5.3 Delete all non-files above the first file. Highlight everything starting at the row above the first file, and go up to (but not including) the column header.

Step 5.4 Delete all blank lines below the last file. Skip down to the last file (ctrl-down arrow, while in the Files column). Select all the rows below that (Ctrl-shift-down). Delete the rows (or, while holding shift, go left to column A, and then hit “delete”).

Step 6: Delete all the subdirectory “files”. DOS lists all the subdirectories in a directory as if they were files. However, it’s easy to filter them out and remove them.

Step 6.1: Label column D “temp”. Type “temp” in D1.

Step 6.2: Find all rows with “<DIR>”. In D2 enter =find(“<DIR>”,c2), and copy the formula down to the rest of the rows. Each cell in that column should now have a number or some sort of error.

Step 6.3: Sort by “temp”.

Step 6.4: Delete all the rows with a number in “temp”.

Now you have all the files, and nothing else. It’s time to filter out what you don’t want. Here are some examples.

Example 1: Filtering out file types. In my case I had a bunch of .psd files that shouldn’t have been there, and certainly wouldn’t need to be updated on a regular basis. So I used the same technique as step 6, except with the formula =find(“.psd”,c2). I could also do the same thing with .pdf or any other file type.

Example 2: Filtering out older files. The date at the start of the file line is the date the file was last modified. Since our longest tenure cycle is 4 years, I might decide that anything that hasn’t been modified in the last 5 years is definitely not something that gets regularly updated. The intuitive solution might be to just sort by Files, since the date is first, but 01/01/2013 will be sorted before 12/31/1990, so that won’t work. However, I can use =mid(c2,7,4) to extract the 4 digits of the year. Then I can sort by that, and delete any row that’s 2007 or below.

Example 3: Filtering out unwanted directories. If I sort by Directory then I can page down and glance through the directories to if there are any that I don’t need to consider. My test directory for experimenting with php? That can go – just highlight it and delete the rows. The directories for past annual meetings? Those aren’t getting updated anymore.

And so on.

Once I got my list of 136, I could still use one more little trick to save me time in checking the files. By extracting the file name with =right(c2,len(c2)-39), and the directory path with =right(b2,len(b2)-14), and concatenating them together with =concatenate(directorypath,””,filename), I could quickly and easily pull them up in a browser to make sure I knew exactly what they were.