Archive for May, 2007

Managing Terabytes - on line and off line

I have previously written about mining terabytes of data with a desktop machine.. some of the strategies that work for me.

Managing that data - cataloging, indexing, retrieving - is also a big issue for me. Data mining not only works with big datasets, it generates big datasets too.. not all of which are interim or ephemeral or easily reproduced. For example, I have a 120gb hard drive dedicated to the Netflix data and various interim files, some of which have taken several processor-days to build : I do not wish to go through the necessary steps again.

And, apart from a few hundred CDs of archived data and programs and a large and growing collection of DVDs (data and images), I have various cloned or archived hard drives which can also be regarded as “off line” media.

So, what is the answer to keeping track of it all?

There are serious conceptual and analytics issues here. It’s not much use designing a system which can retrieve whatever you want as long as you can remember pretty much exactly what it is you want.

Sure, wildcard searches help (but are not explicitly catered for in common indexing/retrieval solutions - eg Google/Copernic). But I want to let my visual cortex, not just my memory, do the work .. I want browsing and visualization and spatial layout.

Anyway, be that as it may, I have been on a hunt for decent cataloging and indexing software, and I thought I would share my findings with you.

First off, cataloging is not full content indexing

Different concepts. Cataloging means recording some basic information about a file - its name and size and date, obviously, its context/location and maybe some other stuff that can be grabbed from the file at little cost (eg EXIF data for some images).

Full content indexing is what you think Google and Copernic do, and they do, to a degree. How they do it is via Ifilters (see my posts
Text Mining and Ifilters
and
Detecting Media Bias with Text Mining )

And full indexing probably isn’t, and you might not need it anyway, but it would be nice to be able to control it, no?

Google themselves say, about Google desktop, that the full content of long documents is not indexed. Hmm. What if the term I want is in the citations at the end of a long PDF.. not uncommon.

Actually Google desktop does not index PDFs at all.

And there is lots of stuff I don’t want indexed, sometimes. Like databases. Or correlation matrices. Or some XML files. Depending upon their location, perhaps.

Mostly, the current offerings don’t give you fine grained control over what is indexed.

Deep cataloging and Deep Indexing is not often done

Deep, as in searching within compressed files (zip, rar etc).

I found out that Google Desktop does not index inside zips AT ALL, which is a pity, since many of my CDs are backups (zipped), and does not index PDFs. There is an article by Neil Rubenking here, explaining its limitations.

So, if Google does not go there, does anything else?. And how deep does it go into nested zips?

Volumes versus Drives

Copernic .. but I don’t think you can force it to index a CD, then put another one in, then another .. it only seems to be aware of drives, not volumes.

Google desktop search (limitations already noted, no indexing inside zips) has a “CD/DVD Spindle Search” plugin which sounds OK if you can live with missing any data that happens to reside in a zip. The plugin is .NET with interop (so will work if you are not running .NET) and works OK, but apparently has problems with very long filenames and of course is subject to the limitations of Google Desktop.

The answer?

I decided I could live with cataloging my offline media as long as I could search and browse and export the catalog. I put CDWinder through its paces, and it does indeed look inside zip and rar files but unfortunately not nested zips (ie only a depth of 1).. this is presumably because to do so it would have to actually unzip the zip to get at the nested zip (and so on), instead of just grabbing the header block. The same reasoning applies to its handling of thumbnails .. works fine for surface images, but not those inside zips.

CDWinder also has some nifty batch features to help catalog your backlog of CDs/DVDs.

I also had a quick look at datacatch but did not proceed.

Conclusion

Managing terabytes is indeed an analytics issue. There is a lot of room for more advanced and statistically/data mining oriented approaches here.

Comments

Taming the Google Monster–we have PageRank 5!. Cool, I guess.

After more than 6 months of PR 0, it is nice to get a PR of 5, all of a sudden.

But why? And why such a big jump?

As a statistician/analyst/modeller, I am somewhat interested in the behaviour of the Google Page Rank system.

Not reverse engineering the black box in order to manipulate it which, besides being arguably against the Terms of Service, is probably a futile endeavor (think, active wealthy antagonist, whose interests it is in to change the rules to frustrate the guys in the black hats).

Just a starting model that is reasonably consistent with my, and others’, experience.

What we think we know - about PR and incoming links

OK, we know that – ostensibly – Page Rank (and remember I am not talking here about Search Engine Result Position, just PR) is a function of the number and PR of incoming links. Some complex function, and if you read the docs it sounds like some form of SVD of the entire web links matrix, which would take some time to compute : if so, it would explain why lots of page ranks are seriously out of date and as Google suggests “for entertainment value”.

It would also explain why my PR jumped from 0 straight to 5, if the recomputation is done every three months or so. Which actually sounds sorta dumb – surely some approximate model could be fitted.

But irrespective of the algorithm, PR is held to be some function of the number of incoming links and their “popularity” as measured by PR.

But the anecdotal evidence does not support this all that well. Get onto any SEO forum eg WebMasterWorld and you will find plenty of stories of sudden and unexplained page rank jumps (often negative), and PR being stubbornly resistant to any increase in the quantity and quality of incoming links. Stories too where an examination of the links of two sites does not bear out the basic proposition.

In my case, I had a PR of 0. There are few incoming links, and those that I have were mostly PR 0 (one PR 4) and now some of which are PR 4 (which runs counter to the notion that you have to get links from sites with higher PR ranks than yours .. I have a 5, some now have a 4. Did they get that from me, or I from them?)

And no, I am not listed on dmoz (I tried) or in any other catalog of note. You can look at my inbound links if you want with (yahoo) linkdomain:www.dsanalytics.com -site:dsanalytics.com : not very impressive.

Is PR 5, like, truly WOW!?
. . . so I must have done something right, right?

A page rank of 5 is, as I understand it, “fairly high”. It is supposedly distributed according to some sort of power law / is on a log (base 10) scale .. so, roughly speaking I should have about 10 times as many incoming links as those with PR 4.

I rather doubt it.

So what’s the theory, and where’s the evidence

Ian Roger’s paper The Google PageRank Algorithm and How It Works supports the notion of a logarithmic scale and also offers some insights (and code to approximate the PR eigenanalysis) about how the internal structure of the site might affect PR of each of the pages.

Well, I have not tried it (looks a bit too simple to be true), and I have done absolutely nothing to the site structure that might make it look better to Google. Nor have I signed on to the Google Analytics project to see how Googlebot sees my site.

There is another description of the PR algorithm at Network Workbench - effectively a power method of extracting eigenvectors, which might be of interest.

There is a bit about the Theory of Networks at Linked: The New Science of Networks by Albert-Laszlo Barabasi which supports the log base 10 notion


On the web, the top websites have ten times more links than the next set, 100 times more links than the third set, and 1,000 times more links than the fourth set. Google’s Page Ranking technology is based on log distribution. A website with Google PageRank 5 (PR5) is ten times bigger than a website with PR4, 100x a PR3, 1,000X a PR2, and 10,000X a PR1 website.

and has an interesting comment on tipping points and phase transitions.

ScienceBits has done some actual analysis of 120 websites and plotted PR as a function of log(Number of Incoming Links) … which looks linear enough ( if noisy ), so there is some support for the notion that PR is approximately linear in number of incoming links.

A compatriot of mine, Dr Alex, a pretty smart fella who wrote the best Javascript editor around (imho) and who also happens to be an International Chess Master with a PhD in Computer and Information Science and an Internet entrepreneur – a polymath from my ancestral home of Adelaide, yet! – has his C Point site “rated 6 out of 10 by Google (aka PR=6) it is in the top 68,000 sites world-wide by traffic - attracting 54,000 visitors per month”.

So, that’s another data point .. PR6 equates to some small top-end fraction of the world’s sites, and so PR really maybe is logarithmic, and so my humble PR5 should mean I am in the top 680,000 sites worldwide and get 5,400 visitors per month. Well, I have no reason to doubt his data but I am still iffy about the extrapolation.

So, there is plenty of speculation out there, and some softish data.

Convinced?

I am still not convinced that the theory of PR and the limited data stack up.

Pleased that Google gave me a PR 5 ( I guess: I rather doubt that it will translate into more visitors ).

But not convinced that PR does not have a very large “random” component.

And perhaps a little suspicious that the Google PR algorithm, which supposedly used to be just a static measure of link popularity (computed at crawl time) is evolving towards something that includes “content quality” and “freshness” so we no longer have such a clear cut distinction between relevance (to a query) and static measures of “popularity”. Or maybe the static measures have become niche-ified, and PR is PR relative to a niche.

Who knows?. These evolving antagonist problems are intrinsically very interesting, though.

And I guess the web dataset is the biggest game in town, bigger than Netflix – sorta hard to resist. I’d quite like to build my own semi-supervised crawler/spider, but perhaps not tonight Josephine.

in conclusion : Google tamed?

So I guess maybe I don’t have Google “tamed” .. just caught it on one of its good days, more likely.

Comments (1)

« Previous entries ·