Archive for March, 2007

Text Mining and Ifilters

This is a small blog about working with text, particularly with text in proprietary formats. If you are not interested in what you can do with text in any form (imprisoned inside PDF, on the web, in blogs, emails .. ) then this blog is not for you.

I am not going to talk much about analyzing the text, searching for patterns and classifying text chunks : that is for another day. Weka has some tools for textual analysis, as does R.

My interest at this stage is in extracting the text from documents, and doing it fast. Obviously we have to extract the text before we can analyze it, before we can classify the documents by content – we might want to do this to help direct search (for example, consider a corporate library of thousands of docunents, or records of help desk queries, or textual comments from a survey, or customer comments in a database, or email messages, or even blog content).

Some document formats are harder than others to decode, to get the text from. PDF springs to mind (try opening a PDF file in a text editor – you won’t see much recognizable text there), but most formats present difficulties.

Decoding Proprietary Formats

Now there are a few ways you can go if you want to decode some proprietary text format.

  1. The owner of the format may have published an API (Applications Programming Interface) specification, or even just a file format description – if so, you can write some custom code to extract the text (data) you are after. This will usually be fairly fast, and is one approach we are taking with PDF files, decoding them directly from a Delphi app.
  2. The application that built the files may be a COM server, and you can use OLE. This will work for Microsoft Office Applications (Word, Excel etc) and some others . It is not likely to be very fast .. however you may be able to extract more metadata (style information, for example) than with other approaches.
  3. Or you can use Ifilters .. that is, write a program that accesses an existing (free or commercial) Ifilter for a particular format, the Ifilter doing the work of text decoding.

About Ifilters

Ifilters are a Microsoft initiative, designed so that the Indexing Service (the engine that indexes “everything” on your hard disk so you can do “fast” searches) can understand proprietary formats. For example, Adobe publishes PDFFilt.dll for this purpose : this is supposed to plugin to the Indexing Service so that you can now find things in PDF files that you thought you could have found before.

For a bit of light relief you may care to read the Knowledgebase article “Using the “A word or phrase in the file” search criterion may not work” noting that

  1. “filter components may ignore some text” and
  2. “…ignores text that is contained in comments” and
  3. “this problem may occur even if you specified the file name or type…” .

No (printable) comment.

Now, you don’t need to use the Indexing Service to use Ifilters.

You can write a custom app to call those Ifilters and extract your text..so, you could say that a proprietary service (Indexing) from Microsoft has forced third party file format vendors to quasi open-source their file formats.

So, this (using Ifilters) is what we have been doing, with Delphi – although it could be with C# or VB.NET, for PDF files (as well as going direct via the PDF API) .

We are also investigating how well the XP supplied filters

• Mimefilt.dll: Filters Multipurpose Internet Mail Extension (MIME) files

• Nlhtml.dll: Filters HTML 3.0 or earlier files

• Offfilt.dll: Filters Microsoft Office files (Microsoft Word, Microsoft Excel, and Microsoft PowerPoint)

• Query.dll: Filters plain text files (default filter) and binary files (null filter)

work, when called from a Win32 app, in practice.

The commercial XML Ifilter from QuiLogic is also of interest.

For more sources of Ifilters, including how to develop with them, look here:

  1. XML IFilter for indexing XML files
  2. Using IFilter in C# - The Code Project - C# Programming
  3. Channel9 Wiki: DesktopSearchIFilters
  4. Citeknet free Ifilters
  5. IFilter dot org - IFilters for Microsoft Search Technologies
  6. Document Management iFilters
  7. Andrew Cencini : Part 3: Testing Full-Text IFilters
  8. Sorting It All Out : I coffee, therefore IFilter (or, Language-specific processing #1)
  9. Add File Types To MSN Desktop Search InsideMicrosoft - part of the Blog News Channel
  10. Adobe PDF IFilter returns unexpected search results (5.0 on Windows XP) - Support Knowledgebase

Depending upon how our investigations go, we might put up another blog on the findings.

As always, you can contact me (via the contact form, no email addresses are published here because of address mining bots and spam) if you have an interest in the area, or leave a comment below.

Does SAS TextMiner use Ifilters?

SAS Text Miner http://www.sas.com/technologies/analytics/datamining/textminer/brochure.pdf
claims “Universal data access”

“With access to numerous forms of
textual data, including Adobe Portable
Document Format (PDF), extended ASCII
Text, HTML and Microsoft Word, users
can extract, transform, and load their
textual data into a SAS data set
for text mining.”

Sounds like Ifilter usage, maybe, but it rather depends on the platform (Ifilters are Win32 only).

Comments (2)

What does Data Analytics Cost?

According to a reasonably recent Computerworld article HCF gets a helping hand from predictive analytics the cost of deploying an in-house, Clementine based, fraudulent claim detection system was around $500,000

The initial deployment of Clementine cost HCF about $400,000. The required ongoing support, training and licensing costs amount to $100,000 a year, Shearman said.

That seems like a large amount for the software alone, particularly considering R and Weka are free. So, what is it that is special about Clementine?

And what SHOULD a company budget for their “predictive analytics”?

Well, that is not entirely a stupid question even taking into account the huge variability in requirements, data availability and quality and yada yada yada…

Maybe we can establish some credible lower bounds on “what data analytics cost”?

This reminds me of an article I read on clickz.com How Much Does a Web Page Cost? in which the figure for a “simple” page works out at around $3,000.

Now, granted it looks like he has padded it a bit here and there (but to my mind he did not allow enough programmer time), but the assumptions and process don’t look grossly wrong.

Granted too that top down estimation of the (means of) the cost components followed by simple summation is naive and probably likely to result in an upward bias of the total.. still, he has a point when he says..

The next time you encounter someone with sticker shock when handing him an estimate, don’t dismiss him. Walk him through what needs to be done and see if he still thinks that page can be whipped out in a couple of hours. It’s an educational opportunity that shouldn’t be missed.

I’d really like some input on this, but I’ll stick my neck out and say that I think it is VERY unlikely that anything worthwhile in analytics could be done for under about $25,000 in consulting fees (exclusive of software costs, if any). A more likely range for an “average” project would be $50,000 to $75,000.

Anyone?

Comments

« Previous entries · Next entries »