Can Text Mining detect Media Bias? How hard can it be?
Chris LLoyd, over at Fishing-In-The-Bay, recently posted a fascinating article “Measuring Media Slant”.
It is a very interesting application of text classification : in a nutshell, the idea is to develop a text classifier (based on their published speeches) to predict party affiliation of politicians.
Then apply that same classifier to text extracted from different newspapers and see how the newspapers are classified as if they were politicians .. From that, with a little bit of a leap of faith, we may be entitled to say (or at least privately believe) that this or that media outlet is left or right leaning, or more bluntly, biased.
Now, of course similarity of language usage, of dialect, does not in and of itself mean that two persons (politician and newspaper editor) share the same political opinions let alone that the editor is deliberately or unconsciously slanting news (perhaps by differential rates of inclusion or omission or different positioning or different article lengths, or rewrites according to policy dictated language standards) - but certainly it is a fascinating application of text classification.
There might even be an immediate application - a sort of “Media Watch” across all the media, automatically monitoring content. But it would be hard to build, thankfully.
I’d like to make a few comments about the theoretical and technical aspects of replicating the cited work in an Australian context.
And they mostly relate to the practicalities of getting an honest dataset together in the first instance.
Like most statisticians and data miners, I’d rather I had the data in my sweaty paws right now .. so I could get to work doing the interesting stuff like keyphrase extraction, and ponder on how I should sample them such that they are in some sense “universally indicative” of political stance.
But, unfortunately, there is a lot of work in getting this text database together.
I won’t even talk about mining the newspapers, just outline what is involved in mining Hansard, the full record of all business in Australia’s house of representatives.
And I will confine my attention to just extracting the text, none of the “metadata” (perhaps material that might relate to the nature of the debate, the type - maiden or otherwise - of the speech itself, the standing of the member etc) that could enrich the entire exercise.
Mining Hansard - the options
OK, when we look at Hansard we have a couple of options
-
Download them as PDF. The download page (well for 2007) is here and you will note that it is nicely structured by month and day - giving rise to the immediate thought that we could scrape that “index” page and develop a sampling scheme, something we are going to need to do if we have that humungous amount of data.
Note that the data in PDF format only spans 1996 to 2007, and the HTML files are more extensive dating from 1981.
Of course we have no idea if there have been format changes over the years.
-
Download them as HTML. If we go to the index page for the HTML files then we can see that they are nicely organized by year, and we can drill down to a particular date and thence to a particular speech.
This is very nice. We have structure, we can distinguish between procedural and policy matters (ie speeches) and we can probably use the actual markup to infer varous things, to help us decode.
So, the HTML option looks very nice and I would probably go that way. I think the best approach would be to build a trainable custom browser that returns a set of “rich features” .
But what if we had to decode PDF?
Well, there are issues - well known issues. PDF is a graphical format and it is not easy to extract the text from an arbitrary document. I believe that it is best to go with an Ifilter approach or a commercial (usually expensive) library - here are some:
For non commercial “batch oriented” extraction of plain text, about the best I have found (although be warned, these tools are not perfect and can lose some information: reverse engineering from graphics to structured text is a real hard problem) are
- XPDF
Xpdf is an open source viewer for Portable Document Format (PDF) files. The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities.Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X components (pdftops, pdftotext, etc.) also run on Win32 systems and should run on pretty much any system with a decent C++ compiler.
- PDFToHTML Pdftohtml is a tool based on the Xpdf package which translates pdf documents into html and XML format.
Here is some XML generated by a PDF converter
<text top="1040" left="457" width="293" height="14" font="0">was estimated back in 2001 that there were </text>
</page>
<page number="25" position="absolute" top="0" left="0" height="1263" width="892">
<text top="190" left="149" width="175" height="14" font="0">Monday, 12 February 2007 </text>
<text top="190" left="353" width="226" height="14" font="0">HOUSE OF REPRESENTATIVES </text>
<text top="190" left="735" width="12" height="14" font="0">7 </text>
<text top="1078" left="408" width="82" height="14" font="0">CHAMBER </text>
<text top="235" left="149" width="292" height="14" font="0">26,676 homeless people in NSW, and in my </text>
<text top="253" left="149" width="292" height="14" font="0">own region an estimated 1,530 people faced </text>
<text top="271" left="149" width="286" height="14" font="0">homelessness every night. Most of the home-</text>
<text top="289" left="149" width="286" height="14" font="0">lessness in the Illawarra region is in fact hid-</text>
<text top="307" left="149" width="292" height="14" font="0">den, with nearly 46 per cent of the recorded </text>
<text top="325" left="149" width="292" height="14" font="0">homeless population living temporarily with </text>
Note the line breaks and hyphenation. More post processing (sigh)
And here is some HTML resulting from a PDF conversion
80 <br>
HOUSE OF REPRESENTATIVES <br>
Monday, 12 February 2007 <br>
personal actions and supporting the needy; ings of the Ballarat Botanical Gardens, the <br>the family--the most important band of POW memorial uses the basic idea of a jour-<br>threads in the social fabric of life; mar-<br>
ney. The creator, Peter Blizzard, started a <br>
riage--the core or centre of the family; chil-<br>
pathway, long and straight, interspersed with <br>
dren--our future--as it is important to shapes like railway sleepers--a reference to <br>praise, uplift and invest in them; education--<br>
the Burma railway. Running parallel to the <br>
investing in our children; community--<br>
Hmm, problems there. Somehow the lines got out of synch.
And here is some text generated from the PDF
responsibility--personal freedom and choice; reward for effort; free enterprise and social equality--balanced with responsibility for
CHAMBER
HOUSE OF REPRESENTATIVES
Monday, 12 February 2007
personal actions and supporting the needy; the family--the most important band of threads in the social fabric of life; marriage--the core or centre of the family; children--our future--as it is important to praise, uplift and invest in them; education-- investing in our children; community-- living out the fruits of the spirit, caring for each other and those less fortunate, mateship and volunteerism; and, of course, faith-- remembering how small we are in the universe, contrasted with the innate value of every human on earth. I believe we live in a time when each day should be a time of thanksgiving--when we remember the tough times and what our forebears fought for; when we remember the good fortune and the bounty available to us; when we remember our responsibilities and our debt to others. I think it is vital that we always ensure that we never take our fortune for granted and that we all strive to maintain it. We must always recognise who helped achieve this great result: senior Australians. And we must remember that Australia must always be the most renowned of all the lands. Australian Ex-Prisoners of War Memorial Ms KING (Ballarat) (5.44 pm)--I rise to speak in this debate to give voice to the grievance of hundreds of Australian exprisoners of war and their families at the government's continued refusal to have the Ex-POW Memorial located in Ballarat declared a national memorial. The government's sheer obstructiveness on this
Well, it is readable and processable, but note how much structural information is lost - it is hard to detect the end of one speech and the beginning of another, content is broken across page headers etc
MORE postprocessing needed. Sigh
So, if you have the choice in Text Mining, is HTML better than PDF as a source?
Yup. A lot of the PDF documents you see on “official” websites are actually PDF generated from HTML, or at least both are sourced from a common structured text base (maybe fields in a database, maybe XML).
That means that the further back you can get towards the semantic content, the better off you are in terms of extracting the underlying informational structure. Because HTML is a markup language it is often used as semantic markup, and in a consistent manner across a set of documents.
That means that you can glean intent and importance from the markup