“Horizontal Data Dancing” - Record Parsing – The First Stage
The first step in any data mining project is often to “clean up the data”.
Data Cleansing is a very large area, and subsumes fields like deduplication, record linkage, constructive induction, imputation for missing data and so on. A lot of these are statistical problems in nature and I won’t get into them in this blog, but concentrate on the very first step – parsing.
The statistician-analyst is often keen to get on with the work and “see what the data has to say”, so the process of getting the data into shape tends to be approached with more than a smattering of ad-hocery.
This does not need to be the case. It is possible to construct “intelligent objects” to help us with the general problem of getting certain classes of data into analyzable form.
I want to focus just on simple record oriented structures, the data you might find in CSV (Comma Separated Variables) files and spreadsheets, sometimes in databases. Accounting records can be of this form and it is not uncommon to have to cleanse separate files of invoices and payments prior to trying to match them.
And I want to talk about “horizontal” records .. those organized “by rows”, where one (occasionally more) line of data constitutes the data of interest. This is in contrast to “vertical” records where the desired information is embedded in more complex structures : examples might be web pages or email messages.
Now, these records may or may not have “fields” (i.e. nominal placeholders for data of a given type). Spreadsheets commonly have specific fields (the columns), CSV files may have headers or the fields may simply be implied by the order within the record.
Human Data Entry – somewhere along the line
The common factor is that the data has usually been entered by humans directly or indirectly (eg via optical character recognition of a scanned document), and not subject to extensive logical checks while being entered – hence fields can contain data other than what might have been intended when the data entry project was started, fields can overflow into others, fields that should have data do not, the contents of fields can be swapped around, a field can be used as a pointer to another field (eg “refer to comments”), upper and lower case can be mixed, abbreviations can be employed.
Complicating factors can include the fact that different people have worked in the data entry task over the years, and there may have been a merge of different files with different formats.
Problems with inconsistent date formats
There are OFTEN problems with dates. Consider
26 Jul 93 04/18/1996 (US style mmddyyyy) 980810 19990519 001210 00001210 02/07/97 Wed, 2 Jul 1997 5-May-1995
Any or all of such variants can be found in a given recordset. But we can build a smart date recognizer to sort this mess out – more details below.
Problems with Quotes and Separators - the need for robust, recoverable CSV decoding
Many CSV files, particularly if exported from other programs, have quotation marks around alpha fields.
Problems arise if this is not consistent, or if there are “ inside such a delimited field.
For example
"Dick Richards","1001 "The Glen"",CA1010
is hard to decode correctly.
One solution to this is to have a recovery program .. if the parser fails for a reason consistent with having embedded quotes, then a smart recovery routine can transform such a record into eg
"Dick Richards","1001 *The Glen*",CA1010
which we can then parse correctly.
We often have blanks between the tokens - these may or may not be significant. Other characters can cause an error e.g. “123 Smith St”,|”Fredville”, or maybe we just want to treat them as ignorable junk.
Multiple adjacent separators also can give us problems. Eg
"Fred Smith",,,"USA"
or equivalently
"Fred Smith" , "", "" , "USA"
Leading or trailing separators may occur and may be permissible
,,"Fred Smith",,
will be interpreted as
" "," ","Fred Smith"," "," "
Multiple adjacent delimiters e.g. “”" may also be the source of errors, and the character chosen as a delimiter should not be present in the token context e.g.
"123 "The Haven"", "Smithfield"
is not easy to parse correctly.
I have been using the terms “parse”, “tokenize” and “decode” rather loosely and interchangeably – at this point I have only been concerned with some of the issues involved in just splitting up the input data records into the constituent elements, not yet grappling with the problem of trying to recognize as being of particular types or as being correct or incorrect.
Human readable – but not so easy for machines
Such records are commonly understandable by humans (taken one record at a time) exercising their “common sense” : those same records can be a nightmare to analyze.
How can we make progress on this problem, in a sensible and practical way.
The short answer – marry some AI techniques with Object Oriented Programming (OOP).
Getting a bit smarter -the “LooksLike” and Pacman Paradigms
Suppose you had an agent – an intelligent piece of software – that could examine a token ( think of a token as just a word, one of the separate parts of the input text) and reports back to you “this looks like a xyz type” (where xyz is, say, a salutation, a street number, a postal code, a suburb name, a dollar figure …).
Suppose also that this agent can report back to you a degree of certainty (this looks like a zipcode, and I am 80% certain) or even absolute certainty (I am certain this is a zipcode BECAUSE YOU TOLD ME SO). So “the little agent that could” (TLATC- too clumsy, let’s call her THELMA) has the ability to ask questions (of you, the super intelligence) and store the answers – she has memory.
Now it is not too hard to write the program code for such an agent .. just a couple of dialog boxes and a load/save to file mechanism (for the memory), and an algorithm to do the recognition and probability calculations. It would be nice, too, if THELMA was not too insistent on asking you questions .. ideally she should know something (from past exercises) about her field of expertise.
And here we get into an interesting area. The field of expertise. Do we want THELMA to be omniscient, or do we want her to know only what she knows, to be an expert and a member of a team of specialists, or a magnificent polymath?
If she is a specialist then, given a token she can say “no, that is not in my area of expertise, I will pass on that one, you need to ask a different specialist”
There are strong theoretical arguments for the “committee of learners” approach. (You can Google on that term and maybe Dan Roth, one of the pioneers, but be warned – the machine learning literature is vast). There are also strong practical arguments .. we can build an ensemble of such learners, then when the next problem comes along just drag Thelma and Louise and HairyMacLary out of the closet and put them to work.
Are we getting too far afield here, too remote from the “simple” problem of parsing? No, I think not. The parsing problem is not simple, agents are easy to program and an agent swarm is an elegant solution.
OK, so far what we have is a bunch of somewhat smart agents arguing over each token. When I say “we” I mean the supervisory program, and that supervisory program is the one that will make the final decision and produce the output – the cleansed data.
This approach can be improved.
Consider that so far we have something like a Pacman game. Agents running around gobbling up tokens in just about any order they please – actually it is the fittest agent that gets to “eat” a particular token, and the order of processing (“eating”) is usually left to right – but nevertheless the mental picture of a Pacman game is apropos.
This is great, works well etc. But note that our agents know nothing of context or of constraints.
Context. Thelma does not know what the supervisor decided the previous token was, and that is probably OK because Thelma is a token classifier not a phrase or higher construct classifier – she does not recognize an “address” just maybe a zip code.. One way forward on this is to have agents supervising agents.
Constraints. We may KNOW, from our preliminary analyses, that there is at most one zipcode in any record (or if more than one is found, we have a notifiable problem). Again, supervisors provide a way forward.
Sequence is a problem somewhere between context and constraints. We may know that we should never have 2 adjacent tokens both classified as zipcodes, we may know that the third field should contain a higher value than the fourth. Finite state machines to manage agent recommendations are a promising avenue.
Management of context, constraints and sequence expectations is not too hard for a supervisor program to implement, certainly not on a case by case basis. And here lies an important issue. Generalization of the problem is all very well, and clearly we can get a lot of efficiencies across a lot of problems by implementing a suitable re-useable low level framework, in this case agents. Taking a little more time to build such an agent library is clearly in our interests, but problem and dataset specific issues are best addressed by custom logic.
One final word, because we have said just about enough on this subject. In some datasets there is a need to move field contents from one to another, swapping, because the data has been entered in the wrong places. We can generalize this issue to that of “optimum repair” .. given a set of expectations about the contents of a record ie which fields there should be, can we repair the original record to most closely match that expectation, with a simple swapping mechanism – perhaps under the control of a GA. A useful approach sometimes.