Managing Terabytes - on line and off line

I have previously written about mining terabytes of data with a desktop machine.. some of the strategies that work for me.

Managing that data - cataloging, indexing, retrieving - is also a big issue for me. Data mining not only works with big datasets, it generates big datasets too.. not all of which are interim or ephemeral or easily reproduced. For example, I have a 120gb hard drive dedicated to the Netflix data and various interim files, some of which have taken several processor-days to build : I do not wish to go through the necessary steps again.

And, apart from a few hundred CDs of archived data and programs and a large and growing collection of DVDs (data and images), I have various cloned or archived hard drives which can also be regarded as “off line” media.

So, what is the answer to keeping track of it all?

There are serious conceptual and analytics issues here. It’s not much use designing a system which can retrieve whatever you want as long as you can remember pretty much exactly what it is you want.

Sure, wildcard searches help (but are not explicitly catered for in common indexing/retrieval solutions - eg Google/Copernic). But I want to let my visual cortex, not just my memory, do the work .. I want browsing and visualization and spatial layout.

Anyway, be that as it may, I have been on a hunt for decent cataloging and indexing software, and I thought I would share my findings with you.

First off, cataloging is not full content indexing

Different concepts. Cataloging means recording some basic information about a file - its name and size and date, obviously, its context/location and maybe some other stuff that can be grabbed from the file at little cost (eg EXIF data for some images).

Full content indexing is what you think Google and Copernic do, and they do, to a degree. How they do it is via Ifilters (see my posts
Text Mining and Ifilters
and
Detecting Media Bias with Text Mining )

And full indexing probably isn’t, and you might not need it anyway, but it would be nice to be able to control it, no?

Google themselves say, about Google desktop, that the full content of long documents is not indexed. Hmm. What if the term I want is in the citations at the end of a long PDF.. not uncommon.

Actually Google desktop does not index PDFs at all.

And there is lots of stuff I don’t want indexed, sometimes. Like databases. Or correlation matrices. Or some XML files. Depending upon their location, perhaps.

Mostly, the current offerings don’t give you fine grained control over what is indexed.

Deep cataloging and Deep Indexing is not often done

Deep, as in searching within compressed files (zip, rar etc).

I found out that Google Desktop does not index inside zips AT ALL, which is a pity, since many of my CDs are backups (zipped), and does not index PDFs. There is an article by Neil Rubenking here, explaining its limitations.

So, if Google does not go there, does anything else?. And how deep does it go into nested zips?

Volumes versus Drives

Copernic .. but I don’t think you can force it to index a CD, then put another one in, then another .. it only seems to be aware of drives, not volumes.

Google desktop search (limitations already noted, no indexing inside zips) has a “CD/DVD Spindle Search” plugin which sounds OK if you can live with missing any data that happens to reside in a zip. The plugin is .NET with interop (so will work if you are not running .NET) and works OK, but apparently has problems with very long filenames and of course is subject to the limitations of Google Desktop.

The answer?

I decided I could live with cataloging my offline media as long as I could search and browse and export the catalog. I put CDWinder through its paces, and it does indeed look inside zip and rar files but unfortunately not nested zips (ie only a depth of 1).. this is presumably because to do so it would have to actually unzip the zip to get at the nested zip (and so on), instead of just grabbing the header block. The same reasoning applies to its handling of thumbnails .. works fine for surface images, but not those inside zips.

CDWinder also has some nifty batch features to help catalog your backlog of CDs/DVDs.

I also had a quick look at datacatch but did not proceed.

Conclusion

Managing terabytes is indeed an analytics issue. There is a lot of room for more advanced and statistically/data mining oriented approaches here.

Leave a Comment