The start of the art in Keyphrase Extraction?
I’d like some input from anyone who has an interest in “keyphrase extraction” - which, amongst its many applications includes the automated or semi automated generation of keyphrases for web pages. (I am not particularly interested at the moment in “keyphrase assignment” ie which of a fixed set of keyphrases should be attached to a given document).
KEA - Keyphrase Extraction Algorithm
The first app that springs to mind is KEA, but after experimenting with it I thought it really only appropriate for the task of keyphrase assignment (keyphrase indexing).
It does require quite a lot of hand work in preparing a training set, with keyphrases being assigned to those documents by “experts”. And I don’t find the assignment algorithm particularly compelling either .. essentially it works by applying the learned model (which is Naive Bayes with features reflecting the position in the document, and the relative frequency of the terms).
(On the plus side, the code is in Java and easy enough to use, and you do not need the full WEKA framework.)
Extractor
Then there is Extractor a commercial offering from Peter Turney, although you can use it online at ExtractorLive. It apparently requires no training (at least not for the end user) and I seem to recall that there is some Genetic Algorithm at work.
I was not greatly impressed by that. When I fed it Amara’s Wavelet Page, I got back (as keyphrases) only
wavelet — signals — wavelet transform — wavelet digest — mathematics — representing — bibliography
which does not seem like a lot to me from such a content rich page.
seo keyword analysis .. at seokeywordanalysis.com
By contrast, Andy Hoskinson’s Keyword Analysis Tool applied to the same URL (Amara’s page) gave me a much richer set
Keyphrase Frequency
wavelet analysis 10
Home Page 5
Wavelet Digest 4
Wavelet Transform 4
frequency analysis 3
Classic version 3
HTML version 3
Executive version 3
subscribe wd 3
unsubscribe wd 3
Wavelet IDR Center 3
wavelet methods 3
Practical Guide 3
Signal Processing 3
Wavelet compression 3
Rehmi Post 3
Wavelet Transform Theory 3
Wavelet Patents 2
Sound Fun 2
NuHAG Gabor Server 2
Discovering Wavelets 2
using wavelets 2
Fourier analysis 2
partial differential equations 2
full digest 2
mail address 2
relevant command 2
body text 2
Fourier transforms 2
Fast Fourier Transform 2
related code links 2
Gabor analysis 2
Edward Aboufadel 2
Grand Valley State University 2
Daubechies wavelets 2
wavelet information 2
produced popular science article 2
Little Wave 2
Big Future 2
Really Friendly Guide 2
Wim Sweldens 2
wavelets site 2
Wavelet Wading Pool 2
Alex Nicolaou 2
Jacques Lewalle 2
wavelet transforms 2
LaTeX2HTML version 2
3D Progressive Transmission Using Wavelets 2
Benj Lipchak 2
computer graphics 2
web pages 2
Wavelet Packet Transform 2
Gentle Introduction 2
Andrey Kiselev 2
Wavelets versus Fourier Course 2
short presentation 2
Wavelets Transformationskodierung 2
Salzburg University 2
software etc 2
Wavelet Papers 2
et al 2
Amara Graps 2
Wavelet Page 1
Wavelet Overview 1
Wavelet Blogsphere 1
Fourier Trivia 1
IDR Center 1
Beginners Bibliography 1
WWW Introductions 1
WWW Sites 1
page contains 1
idea behind wavelets 1
processing data 1
representing data 1
Joseph Fourier 1
Wavelet algorithms process data 1
notice small 1
wavelets interesting 1
approximating sharp 1
approximating functions 1
approximating data 1
sharp discontinuities 1
wavelet prototype 1
prototype wavelet 1
frequency version 1
using coefficients 1
wavelet functions 1
performed using 1
corresponding wavelet coefficients 1
best wavelets 1
coefficients below 1
makes wavelets 1
data compression 1
band coding 1
Wavelets Paper 1
Computational Sciences 1
Online version 1
avelet Digest 1
free newsletter 1
once every 1
concerning wavelets 1
Wavelet Digest comes 1
concise edition 1
HTML mail 1
full articles 1
one edition 1
submission process 1
new Wavelet Digest 1
particular questions concerning 1
back issues 1
avelet Blogsphere 1
new blog 1
find wavelet idea 1
link below 1
Google blog search page 1
Wavelet Blog search 1
avelet Patents 1
via tutorial articles 1
special issues 1
method back 1
Fourier series 1
Fourier domains 1
corresponding domain 1
every domain 1
Joseph Fourier history 1
Fourier Theory 1
Fast Fourier Transform tutorial 1
Harmonic Analysis 1
new wavelet 1
national center 1
download publications 1
Gabor Server 1
Numerical Harmonic Analysis Group 1
ideas related 1
Web site 1
image processing using 1
Haar wavelets 1
wavelets code 1
web sites 1
interesting wavelet 1
related news 1
contact Edward Aboufadel 1
avelet Software 1
wavelet software 1
avelet Introductions 1
wavelet Web 1
based tutorials 1
last years several good tutorials 1
beginners tutorials 1
suggest looking 1
no particular 1
National Academy 1
Dana Mackenzie written 1
ll find 1
no equations 1
know why 1
good introduction 1
Physics World 1
above article 1
authors site 1
digital signal engineer 1
continuous wavelet transform 1
State University 1
WWW tutorial 1
introductory wavelet material 1
best graphics 1
based tutorial 1
image compression 1
Experimental Data 1
site appears 1
Experimental Data Tutorial 1
new analysis tool 1
continuous wavelet transform code 1
continuous wavelet transform demonstration 1
using IDL 1
Wavelets Course given 1
surface material comes 1
many of which appear to be “useful”
PhraseRate - an HTML Keyphrase Extractor
by Keith Humphreys looks very interesting - you can find the paper here
PhraseRate is interactive and designed to
assist human classifiers. It uses a
novel keyphrase extraction heuristic for web pages which requires no training, but instead is based on the
assumption that most well written webpages “suggest” keyphrases based on their internal structure.
The paper is easy reading and I found the ideas quite compelling. Unfortunately I was not able to locate a compiled software implementation, although the paper specifically states that there is one and it is GPL’d.
The closest I got was
libiViaMetadata
http://ivia.ucr.edu/manuals/libiViaMetadata/current/
A GPLed C++ library for assigning descriptive metadata to web files. Developed under the iVia Project. Includes the PhraseRate program which is described at
http://ivia.ucr.edu/projects/PhraseRate/
There are some interesting examples there, but finaly we need to get to libiViaMetadata
This library depends on the libiViaCore-5.1.x library. Many people who install this software will do so in conjunction with iVia, DataFountains or the Nalanda iVia Focused Crawler.
So, it appears that PhraseRate has been subsumed under this much larger C++ library in a very large project - it is all *NIX, too.
Pity. A bit too large a port to take on in my spare time.
Some of it sounds fascinating, too .. particularly the Data Fountains and Focussed Crawling concepts
DataFountains is a tool for discovering and describing Internet resources about a particular topic. After signing on the user is guided through a series of Web pages that generate information describing a particular topic.When the topic is defined, the Nalanda iVia Focused Crawler is used to crawl the Web for new Web pages about the topic. Finally, metadata is generated for each Web page in the result retained from the focused crawl.
If anyone knows of other systems that perform the unsupervised or semi-supervised “keyphrase extraction” task well and run on a Win32 platform, I’d be interested in your comments.