<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.0.1" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: Text Mining and Ifilters</title>
	<link>http://dsanalytics.com/dsblog/text-mining-and-ifilters_96</link>
	<description>Data Analytics- the art and science of analyzing data</description>
	<pubDate>Sun, 06 Jul 2008 04:24:36 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.0.1</generator>

	<item>
		<title>by: Neil</title>
		<link>http://dsanalytics.com/dsblog/text-mining-and-ifilters_96#comment-522</link>
		<pubDate>Thu, 24 Jan 2008 17:11:27 +0000</pubDate>
		<guid>http://dsanalytics.com/dsblog/text-mining-and-ifilters_96#comment-522</guid>
					<description>Just stumbled across your blog entry... V interesting.  I'm currently working on a project to get data out of PDFs - the (somewhat round-about) method we've been perservering with is to control Acrobat from OLE via the javascript bridge.  We tell Acrobat to save the document in XML format and then we work on the text that we can pull out of the XML.

It's far from ideal, but it seems to work and as we're dealing with highly structured reports, it works quite well, but the XML process does lead to some perculiar conversions sometimes!  As a result it needs another layer of text processing / validation to make sure things get extracted properly.  OLE isn't super-fast, as you mentioned, but is sufficient for our needs.

Cheers,
Neil</description>
		<content:encoded><![CDATA[<p>Just stumbled across your blog entry&#8230; V interesting.  I&#8217;m currently working on a project to get data out of PDFs - the (somewhat round-about) method we&#8217;ve been perservering with is to control Acrobat from OLE via the javascript bridge.  We tell Acrobat to save the document in XML format and then we work on the text that we can pull out of the XML.</p>
<p>It&#8217;s far from ideal, but it seems to work and as we&#8217;re dealing with highly structured reports, it works quite well, but the XML process does lead to some perculiar conversions sometimes!  As a result it needs another layer of text processing / validation to make sure things get extracted properly.  OLE isn&#8217;t super-fast, as you mentioned, but is sufficient for our needs.</p>
<p>Cheers,<br />
Neil
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: &#187; BlogMining Thoughts [ Data Sciences Analytics ]</title>
		<link>http://dsanalytics.com/dsblog/text-mining-and-ifilters_96#comment-35</link>
		<pubDate>Mon, 07 May 2007 07:04:38 +0000</pubDate>
		<guid>http://dsanalytics.com/dsblog/text-mining-and-ifilters_96#comment-35</guid>
					<description>[...] I have posted on somewhat similar topics before - specifically &amp;#8220;Text Mining and Ifilters&amp;#8221; and Mining the Spirit of the Times – The Zeitgeist and The Buzz, but I figure I had a little more to say specifically on the topic of blogmining. [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] I have posted on somewhat similar topics before - specifically &#8220;Text Mining and Ifilters&#8221; and Mining the Spirit of the Times – The Zeitgeist and The Buzz, but I figure I had a little more to say specifically on the topic of blogmining. [&#8230;]
</p>
]]></content:encoded>
				</item>
</channel>
</rss>
