Agile and Literate Data Entry- is YAML the Answer?
I have, more frequently than perhaps I would wish, a requirement to enter data, or to have it entered for me, or with me ( a collaborative effort).
The data can be simple enough, but not so simple that I would use a spreadsheet for the data entry task; a scrolling horizontal grid is very unsuited to the task, even with locked column headers.
Fully featured data entry programs exist, indeed I have written quite a few myself, but the paradigm there is that
-
one knows exactly what the data looks like beforehand, what the fields are, what the validation procedures are/should be … that is OK for large repetitive data entry jobs (but anyone who has been involved in real life with a data preparation room will know how rarely one’s assumed knowledge translates into practice, and how frequently special cases arise).
Importantly, this is not an agile approach .. agile, as in “agile programming” can be applied to data entry too.. it means starting now, getting on with the job, treating special cases at some later date or being in a position to quickly adjust the framework so that “special needs” become “system features”.
- a specialist data entry app has to be distributed, or deployed, with the concomitant issues of data definition setup, OS compatibility, version control, and synchronizing. This is a PITA.
Typically I might have a few hundred or a few thousand records to get entered, they don’t fit some standard model, I have a few people available to me to do it, and I don’t want to have to write a new app, tweak an existing app, provide specs or inifiles.. or train the people.
Agility rules OK. I want the job done in elapsed time of a day or two or three, with minimal supervision from me.
Text Rules, OK? .. but TEXT IS NOT CSV
Well, if I don’t want a specialist data entry app and I don’t want a spreadsheet, and I do want zero-install .. what else do I want?.
Well, I want the format to be “literate” .. that is, I want both the writer (data entry person) and the reader (the person using the data) to be able to read it. Easily. Like a book.
And I want the data to be enterable by ANY text editor. A specialist one if that is what people like, maybe a syntax highlighting editor .. a couple of cool ones around based on Synedit and more recently on Scintilla/Scite .. or maybe Word or Notepad. Anything.
Not all data needs to be further processed. Maybe it just needs to sit there as a reference, to be read/reviewed/searched. Maybe it needs just minimal further processing for readability or publishing purposes (some sort of Tidy or XSLT if one must); maybe it needs validity checking.
OK, what formats do we know that are just text?
- XML. Not an option .. the visual clutter of the markup tags make it too hard for the data entry person.
- SOX, no ( a minor simplification). http://en.wikipedia.org/wiki/Simple_Outline_XML
- JSON .. no thanks. A simplification, but again really it is designed for data exchange between machines.
And Data Entry is Not Data Exchange.
Maybe YAML. ?
YAML
Well, here it is.
Wikipedia http://en.wikipedia.org/wiki/YAML says that YAML is “human-readable data serialization format that takes concepts from languages such as XML, C, Python, Perl, as well as the format for electronic mail as specified by RFC 2822”.
But YAML is better than that sounds. “Human readable?”. Well, so is XML .
YAML is actually Human Writeable.
btw YAML is an acronym for “YAML Ain’t Markup Language” and pronounced “yamel”. Like camel..
The Wikipedia article is OK, but have a look at the discussion in Symfony where it is embedded.
Symfony is much more than YAML, it is a web applications framework that embraces YAML , a framework that could conceivably be used to build a distributed “natural” data entry program .. but let’s just stick to YAML, its principles and how we write it.
Well, look at how we read it first.
This is a typical YAMEL data item (in plain text).
house:
family:
name: Doe
parents:
- John
- Jane
children:
- Paul
- Mark
- Simone
address:
number: 34
street: Main Street
city: Nowheretown
zipcode: “12345″
In YAML, structure is shown through indentation, sequence items (as in, items in a collection) are denoted by a dash, and key/value pairs within a map are separated by a colon.
YAML also has a shorthand syntax to describe the same structure with fewer lines, where arrays are explicitly shown with [] and hashes with {}
house:
family: { name: Doe, parents: [John, Jane], children: [Paul, Mark, Simone] }
address: { number: 34, street: Main Street, city: Nowheretown, zipcode: “12345″ }
This looks very WRITEABLE, particularly through line-oriented (As per the first example) blank “templates” that we copy and paste as many times as we like. It is also (presumably) extensible .. in the above example we could have as many children as we liked (including zero) .. but we could also add “fields” as we go along (in which case the final processing app, if there is one, will need to sort things out).
Compare this to a DTD or XLS (schema) driven approach, where the data description is assumed to be known and knowable up front.
There is much more to YAML.. multiple documents/data items in a stream, for instance (separated by — )
# Ranking of 1998 home runs
—
- Mark McGwire
- Sammy Sosa
- Ken Griffey# Team ranking
—
- Chicago Cubs
- St Louis Cardinals
and much more info about it at http://yaml.org/spec/current.html (more than you will likely want to know). There is some more information about bindings and the grammar at libYaml http://pyyaml.org/wiki/LibYAML
R and YAML
There is a YAML package for R http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/YamlR , but as far as I can work out it is only for Unix, and does have a dependency on Syck http://whytheluckystiff.net/syck/ (which may or may not be under development . Wikipedia seems to think not)
In Conclusion:
The YAML concept appears attractive, given that I want to use text based literate and flexible/agile data entry without a specialist data entry app.
But a full implementation of the YAML spec or the Syck version is not without its issues imho, and I would probably opt for allowing only a cut down version of the syntax in any micro-YAML data reader.
YAML does seem to be most active in the agile languages area (see for example Yaml Cookbook http://yaml4r.sourceforge.net/cookbook/
at the YamlForRuby site
John Aitchison said,
March 7, 2008 @ 2:55 pm
For a light hearted look at YAML and what it can do for you in the real world
see http://www.zefhemel.com/archives/2004/10/30/yaml-because-xml-is-for-wussies
YAML, Because XML is for Wussies
If you are seriously interested in the relationship between analytics and documentation/reporting have a close look at the “Model Based Documentation” approach .. there is some YAML in there http://www.openresource.com/MBD/mbd_extraction.php
but it is a much larger topic than just YAML.. it is about how we bring order and retrievability to document collections with a “modelling” approach, how we integrate wisdom, information and data.
http://www.openresource.com/MBD/mbd_overview.php
John Aitchison said,
May 4, 2008 @ 2:33 pm
If you aren’t yet convinced about YAML, here is my favourite Delphi implementation of JSON progdigy.com - JSON Toolkit
More here,
JSON, if you like other languages.. even one in R
CRAN - Package rjson : Converts R object into JSON objects and vice-versa