Archive for March, 2007

The start of the art in Keyphrase Extraction?

I’d like some input from anyone who has an interest in “keyphrase extraction” - which, amongst its many applications includes the automated or semi automated generation of keyphrases for web pages. (I am not particularly interested at the moment in “keyphrase assignment” ie which of a fixed set of keyphrases should be attached to a given document).

KEA - Keyphrase Extraction Algorithm

The first app that springs to mind is KEA, but after experimenting with it I thought it really only appropriate for the task of keyphrase assignment (keyphrase indexing).

It does require quite a lot of hand work in preparing a training set, with keyphrases being assigned to those documents by “experts”. And I don’t find the assignment algorithm particularly compelling either .. essentially it works by applying the learned model (which is Naive Bayes with features reflecting the position in the document, and the relative frequency of the terms).

(On the plus side, the code is in Java and easy enough to use, and you do not need the full WEKA framework.)

Extractor

Then there is Extractor a commercial offering from Peter Turney, although you can use it online at ExtractorLive. It apparently requires no training (at least not for the end user) and I seem to recall that there is some Genetic Algorithm at work.

I was not greatly impressed by that. When I fed it Amara’s Wavelet Page, I got back (as keyphrases) only

wavelet — signals — wavelet transform — wavelet digest — mathematics — representing — bibliography

which does not seem like a lot to me from such a content rich page.

seo keyword analysis .. at seokeywordanalysis.com

By contrast, Andy Hoskinson’s Keyword Analysis Tool applied to the same URL (Amara’s page) gave me a much richer set

Keyphrase Frequency
wavelet analysis 10
Home Page 5
Wavelet Digest 4
Wavelet Transform 4
frequency analysis 3
Classic version 3
HTML version 3
Executive version 3
subscribe wd 3
unsubscribe wd 3
Wavelet IDR Center 3
wavelet methods 3
Practical Guide 3
Signal Processing 3
Wavelet compression 3
Rehmi Post 3
Wavelet Transform Theory 3
Wavelet Patents 2
Sound Fun 2
NuHAG Gabor Server 2
Discovering Wavelets 2
using wavelets 2
Fourier analysis 2
partial differential equations 2
full digest 2
mail address 2
relevant command 2
body text 2
Fourier transforms 2
Fast Fourier Transform 2
related code links 2
Gabor analysis 2
Edward Aboufadel 2
Grand Valley State University 2
Daubechies wavelets 2
wavelet information 2
produced popular science article 2
Little Wave 2
Big Future 2
Really Friendly Guide 2
Wim Sweldens 2
wavelets site 2
Wavelet Wading Pool 2
Alex Nicolaou 2
Jacques Lewalle 2
wavelet transforms 2
LaTeX2HTML version 2
3D Progressive Transmission Using Wavelets 2
Benj Lipchak 2
computer graphics 2
web pages 2
Wavelet Packet Transform 2
Gentle Introduction 2
Andrey Kiselev 2
Wavelets versus Fourier Course 2
short presentation 2
Wavelets Transformationskodierung 2
Salzburg University 2
software etc 2
Wavelet Papers 2
et al 2
Amara Graps 2
Wavelet Page 1
Wavelet Overview 1
Wavelet Blogsphere 1
Fourier Trivia 1
IDR Center 1
Beginners Bibliography 1
WWW Introductions 1
WWW Sites 1
page contains 1
idea behind wavelets 1
processing data 1
representing data 1
Joseph Fourier 1
Wavelet algorithms process data 1
notice small 1
wavelets interesting 1
approximating sharp 1
approximating functions 1
approximating data 1
sharp discontinuities 1
wavelet prototype 1
prototype wavelet 1
frequency version 1
using coefficients 1
wavelet functions 1
performed using 1
corresponding wavelet coefficients 1
best wavelets 1
coefficients below 1
makes wavelets 1
data compression 1
band coding 1
Wavelets Paper 1
Computational Sciences 1
Online version 1
avelet Digest 1
free newsletter 1
once every 1
concerning wavelets 1
Wavelet Digest comes 1
concise edition 1
HTML mail 1
full articles 1
one edition 1
submission process 1
new Wavelet Digest 1
particular questions concerning 1
back issues 1
avelet Blogsphere 1
new blog 1
find wavelet idea 1
link below 1
Google blog search page 1
Wavelet Blog search 1
avelet Patents 1
via tutorial articles 1
special issues 1
method back 1
Fourier series 1
Fourier domains 1
corresponding domain 1
every domain 1
Joseph Fourier history 1
Fourier Theory 1
Fast Fourier Transform tutorial 1
Harmonic Analysis 1
new wavelet 1
national center 1
download publications 1
Gabor Server 1
Numerical Harmonic Analysis Group 1
ideas related 1
Web site 1
image processing using 1
Haar wavelets 1
wavelets code 1
web sites 1
interesting wavelet 1
related news 1
contact Edward Aboufadel 1
avelet Software 1
wavelet software 1
avelet Introductions 1
wavelet Web 1
based tutorials 1
last years several good tutorials 1
beginners tutorials 1
suggest looking 1
no particular 1
National Academy 1
Dana Mackenzie written 1
ll find 1
no equations 1
know why 1
good introduction 1
Physics World 1
above article 1
authors site 1
digital signal engineer 1
continuous wavelet transform 1
State University 1
WWW tutorial 1
introductory wavelet material 1
best graphics 1
based tutorial 1
image compression 1
Experimental Data 1
site appears 1
Experimental Data Tutorial 1
new analysis tool 1
continuous wavelet transform code 1
continuous wavelet transform demonstration 1
using IDL 1
Wavelets Course given 1
surface material comes 1

many of which appear to be “useful”

PhraseRate - an HTML Keyphrase Extractor

by Keith Humphreys looks very interesting - you can find the paper here

PhraseRate is interactive and designed to
assist human classifiers. It uses a
novel keyphrase extraction heuristic for web pages which requires no training, but instead is based on the
assumption that most well written webpages “suggest” keyphrases based on their internal structure.

The paper is easy reading and I found the ideas quite compelling. Unfortunately I was not able to locate a compiled software implementation, although the paper specifically states that there is one and it is GPL’d.

The closest I got was

libiViaMetadata

http://ivia.ucr.edu/manuals/libiViaMetadata/current/

A GPLed C++ library for assigning descriptive metadata to web files. Developed under the iVia Project. Includes the PhraseRate program which is described at

http://ivia.ucr.edu/projects/PhraseRate/

There are some interesting examples there, but finaly we need to get to libiViaMetadata

This library depends on the libiViaCore-5.1.x library. Many people who install this software will do so in conjunction with iVia, DataFountains or the Nalanda iVia Focused Crawler.

So, it appears that PhraseRate has been subsumed under this much larger C++ library in a very large project - it is all *NIX, too.

Pity. A bit too large a port to take on in my spare time.

Some of it sounds fascinating, too .. particularly the Data Fountains and Focussed Crawling concepts


DataFountains is a tool for discovering and describing Internet resources about a particular topic. After signing on the user is guided through a series of Web pages that generate information describing a particular topic.

When the topic is defined, the Nalanda iVia Focused Crawler is used to crawl the Web for new Web pages about the topic. Finally, metadata is generated for each Web page in the result retained from the focused crawl.

If anyone knows of other systems that perform the unsupervised or semi-supervised “keyphrase extraction” task well and run on a Win32 platform, I’d be interested in your comments.

Comments (2)

Can Text Mining detect Media Bias? How hard can it be?

Chris LLoyd, over at Fishing-In-The-Bay, recently posted a fascinating article “Measuring Media Slant”.

It is a very interesting application of text classification : in a nutshell, the idea is to develop a text classifier (based on their published speeches) to predict party affiliation of politicians.

Then apply that same classifier to text extracted from different newspapers and see how the newspapers are classified as if they were politicians .. From that, with a little bit of a leap of faith, we may be entitled to say (or at least privately believe) that this or that media outlet is left or right leaning, or more bluntly, biased.

Now, of course similarity of language usage, of dialect, does not in and of itself mean that two persons (politician and newspaper editor) share the same political opinions let alone that the editor is deliberately or unconsciously slanting news (perhaps by differential rates of inclusion or omission or different positioning or different article lengths, or rewrites according to policy dictated language standards) - but certainly it is a fascinating application of text classification.

There might even be an immediate application - a sort of “Media Watch” across all the media, automatically monitoring content. But it would be hard to build, thankfully.

I’d like to make a few comments about the theoretical and technical aspects of replicating the cited work in an Australian context.

And they mostly relate to the practicalities of getting an honest dataset together in the first instance.

Like most statisticians and data miners, I’d rather I had the data in my sweaty paws right now .. so I could get to work doing the interesting stuff like keyphrase extraction, and ponder on how I should sample them such that they are in some sense “universally indicative” of political stance.

But, unfortunately, there is a lot of work in getting this text database together.

I won’t even talk about mining the newspapers, just outline what is involved in mining Hansard, the full record of all business in Australia’s house of representatives.

And I will confine my attention to just extracting the text, none of the “metadata” (perhaps material that might relate to the nature of the debate, the type - maiden or otherwise - of the speech itself, the standing of the member etc) that could enrich the entire exercise.

Mining Hansard - the options

OK, when we look at Hansard we have a couple of options

  1. Download them as PDF. The download page (well for 2007) is here and you will note that it is nicely structured by month and day - giving rise to the immediate thought that we could scrape that “index” page and develop a sampling scheme, something we are going to need to do if we have that humungous amount of data.

    Note that the data in PDF format only spans 1996 to 2007, and the HTML files are more extensive dating from 1981.

    Of course we have no idea if there have been format changes over the years.

  2. Download them as HTML. If we go to the index page for the HTML files then we can see that they are nicely organized by year, and we can drill down to a particular date and thence to a particular speech.

    This is very nice. We have structure, we can distinguish between procedural and policy matters (ie speeches) and we can probably use the actual markup to infer varous things, to help us decode.

So, the HTML option looks very nice and I would probably go that way. I think the best approach would be to build a trainable custom browser that returns a set of “rich features” .

But what if we had to decode PDF?

Well, there are issues - well known issues. PDF is a graphical format and it is not easy to extract the text from an arbitrary document. I believe that it is best to go with an Ifilter approach or a commercial (usually expensive) library - here are some:

For non commercial “batch oriented” extraction of plain text, about the best I have found (although be warned, these tools are not perfect and can lose some information: reverse engineering from graphics to structured text is a real hard problem) are

  • XPDF
    Xpdf is an open source viewer for Portable Document Format (PDF) files. The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.

    Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X components (pdftops, pdftotext, etc.) also run on Win32 systems and should run on pretty much any system with a decent C++ compiler.

  • PDFToHTML Pdftohtml is a tool based on the Xpdf package which translates pdf documents into html and XML format.

Here is some XML generated by a PDF converter

<text top="1040" left="457" width="293" height="14" font="0">was estimated back in 2001 that there were </text>
</page>
<page number="25" position="absolute" top="0" left="0" height="1263" width="892">
<text top="190" left="149" width="175" height="14" font="0">Monday, 12 February 2007 </text>
<text top="190" left="353" width="226" height="14" font="0">HOUSE OF REPRESENTATIVES </text>
<text top="190" left="735" width="12" height="14" font="0">7 </text>
<text top="1078" left="408" width="82" height="14" font="0">CHAMBER </text>
<text top="235" left="149" width="292" height="14" font="0">26,676 homeless people in NSW, and in my </text>
<text top="253" left="149" width="292" height="14" font="0">own region an estimated 1,530 people faced </text>
<text top="271" left="149" width="286" height="14" font="0">homelessness every night. Most of the home-</text>
<text top="289" left="149" width="286" height="14" font="0">lessness in the Illawarra region is in fact hid-</text>
<text top="307" left="149" width="292" height="14" font="0">den, with nearly 46 per cent of the recorded </text>
<text top="325" left="149" width="292" height="14" font="0">homeless population living temporarily with </text>

Note the line breaks and hyphenation. More post processing (sigh)

And here is some HTML resulting from a PDF conversion

80 <br>
HOUSE OF REPRESENTATIVES <br>
Monday, 12 February 2007 <br>
personal actions and supporting the needy;  ings of the Ballarat Botanical Gardens, the <br>the family--the most important band of  POW memorial uses the basic idea of a jour-<br>threads in the social fabric of life; mar-<br>
ney. The creator, Peter Blizzard, started a <br>
riage--the core or centre of the family; chil-<br>
pathway, long and straight, interspersed with <br>
dren--our future--as it is important to  shapes like railway sleepers--a reference to <br>praise, uplift and invest in them; education--<br>
the Burma railway. Running parallel to the <br>
investing in our children; community--<br>

Hmm, problems there. Somehow the lines got out of synch.

And here is some text generated from the PDF

responsibility--personal freedom and choice; reward for effort; free enterprise and social equality--balanced with responsibility for

CHAMBER

HOUSE OF REPRESENTATIVES

Monday, 12 February 2007

personal actions and supporting the needy; the family--the most important band of threads in the social fabric of life; marriage--the core or centre of the family; children--our future--as it is important to praise, uplift and invest in them; education-- investing in our children; community-- living out the fruits of the spirit, caring for each other and those less fortunate, mateship and volunteerism; and, of course, faith-- remembering how small we are in the universe, contrasted with the innate value of every human on earth. I believe we live in a time when each day should be a time of thanksgiving--when we remember the tough times and what our forebears fought for; when we remember the good fortune and the bounty available to us; when we remember our responsibilities and our debt to others. I think it is vital that we always ensure that we never take our fortune for granted and that we all strive to maintain it. We must always recognise who helped achieve this great result: senior Australians. And we must remember that Australia must always be the most renowned of all the lands. Australian Ex-Prisoners of War Memorial Ms KING (Ballarat) (5.44 pm)--I rise to speak in this debate to give voice to the grievance of hundreds of Australian exprisoners of war and their families at the government's continued refusal to have the Ex-POW Memorial located in Ballarat declared a national memorial. The government's sheer obstructiveness on this 

Well, it is readable and processable, but note how much structural information is lost - it is hard to detect the end of one speech and the beginning of another, content is broken across page headers etc

MORE postprocessing needed. Sigh

So, if you have the choice in Text Mining, is HTML better than PDF as a source?

Yup. A lot of the PDF documents you see on “official” websites are actually PDF generated from HTML, or at least both are sourced from a common structured text base (maybe fields in a database, maybe XML).

That means that the further back you can get towards the semantic content, the better off you are in terms of extracting the underlying informational structure. Because HTML is a markup language it is often used as semantic markup, and in a consistent manner across a set of documents.

That means that you can glean intent and importance from the markup

Comments

« Previous entries ·