Archive for musings

Agile and Literate Data Entry- is YAML the Answer?

I have, more frequently than perhaps I would wish, a requirement to enter data, or to have it entered for me, or with me ( a collaborative effort).

The data can be simple enough, but not so simple that I would use a spreadsheet for the data entry task; a scrolling horizontal grid is very unsuited to the task, even with locked column headers.

Fully featured data entry programs exist, indeed I have written quite a few myself, but the paradigm there is that

  • one knows exactly what the data looks like beforehand, what the fields are, what the validation procedures are/should be … that is OK for large repetitive data entry jobs (but anyone who has been involved in real life with a data preparation room will know how rarely one’s assumed knowledge translates into practice, and how frequently special cases arise).

    Importantly, this is not an agile approach .. agile, as in “agile programming” can be applied to data entry too.. it means starting now, getting on with the job, treating special cases at some later date or being in a position to quickly adjust the framework so that “special needs” become “system features”.

  • a specialist data entry app has to be distributed, or deployed, with the concomitant issues of data definition setup, OS compatibility, version control, and synchronizing. This is a PITA.

    Typically I might have a few hundred or a few thousand records to get entered, they don’t fit some standard model, I have a few people available to me to do it, and I don’t want to have to write a new app, tweak an existing app, provide specs or inifiles.. or train the people.

    Agility rules OK. I want the job done in elapsed time of a day or two or three, with minimal supervision from me.

Text Rules, OK? .. but TEXT IS NOT CSV

Well, if I don’t want a specialist data entry app and I don’t want a spreadsheet, and I do want zero-install .. what else do I want?.

Well, I want the format to be “literate” .. that is, I want both the writer (data entry person) and the reader (the person using the data) to be able to read it. Easily. Like a book.

And I want the data to be enterable by ANY text editor. A specialist one if that is what people like, maybe a syntax highlighting editor .. a couple of cool ones around based on Synedit and more recently on Scintilla/Scite .. or maybe Word or Notepad. Anything.

Not all data needs to be further processed. Maybe it just needs to sit there as a reference, to be read/reviewed/searched. Maybe it needs just minimal further processing for readability or publishing purposes (some sort of Tidy or XSLT if one must); maybe it needs validity checking.

OK, what formats do we know that are just text?

  • XML. Not an option .. the visual clutter of the markup tags make it too hard for the data entry person.
  • SOX, no ( a minor simplification). http://en.wikipedia.org/wiki/Simple_Outline_XML
  • JSON .. no thanks. A simplification, but again really it is designed for data exchange between machines.

And Data Entry is Not Data Exchange.

Maybe YAML. ?

YAML

Well, here it is.

Wikipedia http://en.wikipedia.org/wiki/YAML says that YAML is “human-readable data serialization format that takes concepts from languages such as XML, C, Python, Perl, as well as the format for electronic mail as specified by RFC 2822”.

But YAML is better than that sounds. “Human readable?”. Well, so is XML .

YAML is actually Human Writeable.

btw YAML is an acronym for “YAML Ain’t Markup Language” and pronounced “yamel”. Like camel..

The Wikipedia article is OK, but have a look at the discussion in Symfony where it is embedded.

Symfony is much more than YAML, it is a web applications framework that embraces YAML , a framework that could conceivably be used to build a distributed “natural” data entry program .. but let’s just stick to YAML, its principles and how we write it.

Well, look at how we read it first.

This is a typical YAMEL data item (in plain text).

house:
family:
name: Doe
parents:
- John
- Jane
children:
- Paul
- Mark
- Simone
address:
number: 34
street: Main Street
city: Nowheretown
zipcode: “12345″

In YAML, structure is shown through indentation, sequence items (as in, items in a collection) are denoted by a dash, and key/value pairs within a map are separated by a colon.

YAML also has a shorthand syntax to describe the same structure with fewer lines, where arrays are explicitly shown with [] and hashes with {}

house:
family: { name: Doe, parents: [John, Jane], children: [Paul, Mark, Simone] }
address: { number: 34, street: Main Street, city: Nowheretown, zipcode: “12345″ }

This looks very WRITEABLE, particularly through line-oriented (As per the first example) blank “templates” that we copy and paste as many times as we like. It is also (presumably) extensible .. in the above example we could have as many children as we liked (including zero) .. but we could also add “fields” as we go along (in which case the final processing app, if there is one, will need to sort things out).

Compare this to a DTD or XLS (schema) driven approach, where the data description is assumed to be known and knowable up front.

There is much more to YAML.. multiple documents/data items in a stream, for instance (separated by — )

# Ranking of 1998 home runs

- Mark McGwire
- Sammy Sosa
- Ken Griffey

# Team ranking

- Chicago Cubs
- St Louis Cardinals

and much more info about it at http://yaml.org/spec/current.html (more than you will likely want to know). There is some more information about bindings and the grammar at libYaml http://pyyaml.org/wiki/LibYAML

R and YAML

There is a YAML package for R http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/YamlR , but as far as I can work out it is only for Unix, and does have a dependency on Syck http://whytheluckystiff.net/syck/ (which may or may not be under development . Wikipedia seems to think not)

In Conclusion:

The YAML concept appears attractive, given that I want to use text based literate and flexible/agile data entry without a specialist data entry app.

But a full implementation of the YAML spec or the Syck version is not without its issues imho, and I would probably opt for allowing only a cut down version of the syntax in any micro-YAML data reader.

YAML does seem to be most active in the agile languages area (see for example Yaml Cookbook http://yaml4r.sourceforge.net/cookbook/
at the YamlForRuby site

Comments (2)

before “SuperCrunchers” .. The Little Blue Book That Beats The Market ..?

I was going to write a blog on “SuperCrunchers” .. that is, when I had decided to laugh, cry or shrug (and I will write the blog post when I decide) … having recently been alerted to the book’s existence courtesy of John Maindonald’s (ANU) posting on I think STAT-L or ANZSTAT and then via Chris LLoyd’s Fishing-In-The-Bay.

So, I duly got the book from Amazon - who, btw for those who are interested in recommmender systems, did indeed manage to onsell me on to another couple of books neither of which I would have bought had I perused them in a bookshop [and thereby no doubt hangs another tale] - and sat down to write this post.

But then I remembered another book somewhat in the genre of “get some data, analyze them, get rich/wise/popular” if one may be permitted to characterize “crunching” as such, that I had promised to discuss.

It’s about “stockmarket investing” (perhaps the more cynical might say that almost qualifies as an oxymoron).

And so I will start this blog with a bit of an apologia, a nod to the prevailing attitudes and collective wisdom of my readers.

Most statisticians that I know won’t go near the stock market.

Maybe they have themselves proved (or had proved) to their satisfaction that the amount of predictability in historical information about prices is too low to make it worth the transaction costs and the risks (a sensible position), maybe they distrust the historical data (again, sensible), maybe they just have better things to do with their time : they are by and large pragmatic people, not given to searching for the chimeric magic formula.

Actually there are a few statisticians with a somewhat less dismissive attitude: I quote from West and Harrison “Bayesian Forecasting and Dynamic Models” p 46

Often, when applied to series such as daily share, stock and commodity prices, the appropriate values of V in a first-order model may appear to be close to zero .. This feature has led many economic forecasters to conclude that such series are purely random ..Such conclusions are quite erroneous and reflect myopic modelling; clearly the model is restricted and inappropriate as far as forecasting is concerned. It is rather like looking at a cathedral through a microscope and concluding that it has no discernable form. In other words, other models and other ways of seeing are required.

Be that as it may, I have a passing data-analytical interest in the market

There are huge data management and data quality issues, and lots of statistical and data visualization challenges.

So, a friend who knows of this foible asked me to take a look at “The Little Book That Beats the Market“ (also at http://magicformulainvesting.com/)

Well, OK, It’s an easy read.. if you have nothing much better to do, you can read the first 100 or so pages or you can skip to p138 for “the magic formula”. with a more or less English language explanation of it on p53.

It’s a stock screening approach using just 2 factors (that’s probably good, Occam’s Razor and 1R etc), with the objective of buying “good businesses at bargain prices”.

The two factors are ranked return on capital and ranked earnings yield. The ranks are summed (which is a marginally odd thing to do), and those companies that have a combined rank in the top 30 are included in the portfolio. After a year or so they are more or less unceremoniously dumped out of the portfolio (again, an odd thing to do, without further rationale), and a new lot selected. This strategy yields, supposedly, an annual return of 30% compared to an average market return of 12%.

The data on which this portfolio selection strategy was developed and assessed is Compustat’s “Point in Time” database which goes back 17 years and supposedly includes all companies (including those that later disappeared), so there is an attempt at least to avoid survivorship bias. That database is not cheap (I think about $80,000) so I guess few of us will trawl through it to confirm the accuracy of the results.

There is more information on page 146 about how other biases were avoided, and it is claimed that the two factors finally employed were the first two considered (as a defense against accusations of excessive data dredging). There is a comparison (favorable) of the performance of “the magic formula” against a 71 factor model.

OK, what do I think ?

a) the book would not have been published if the results were not good

b) the guy should not have published the book if the results were that good (i.e. he could have used it to make money himself)

c) no matter how well it is done, it is still an EMPIRICAL approach. I for one would not trust estimates of the longest run of negative returns based purely on the observed results . the downside could be much much worse than that

d) it is a very long term approach. The world is not a deterministic place.. the ground rules could easily change in the next 17 years. It would take a stubborn individual indeed to stick to a strategy published in 2006 until 2023, no matter what : so, if such individuals don’t exist, what point is there in a “magic formula”

However, it does seem plausible that portfolio construction and adjustment is sensible and that anything along those lines (even the magic formula) is better than nothing.

I refer you, for your interest, to “A Simple but Revolutionary Investment Idea” ( http://econlog.econlib.org/archives/2006/06/a_simple_but_re.html )

Fundamental indexation means that each stock in a portfolio is weighted not by its market capitalization, but by some fundamental metric, such as aggregate sales or aggregate dividends. Like capitalization-weighted indexes, fundamental indexes involve no security analysis but must be rebalanced periodically by purchasing more shares of firms whose price has gone down more than a fundamental metric, such as sales, and selling shares in those firms whose price has risen more than the fundamental metric.

…According to my research, dividend-weighted indexes outperform capitalization-weighted indexes and are particularly valuable at withstanding bear markets. For example, the Russell 3000 Index lost almost 50% of its value between the bull market peak of March 2000 and the October 2002 low. Over this same period, a comparable total market dividend-weighted index was virtually unchanged. A dividend weighted index did have a bear market, but it only corrected by 20%. Moreover, the dividend-weighted index bear market didn’t start until March 2002, and it lasted only six months (compared to 24 months for the cap-weighted index). The dividend-weighted index is now about 40% above its March 2000 close, whereas the S&P 500 and Russell 3000 are still not yet back to even. A similar performance occurred in other bear markets.

The historical data make an extremely persuasive case for fundamental indexing. From 1964 through 2005, a total market dividend-weighted index of all U.S. stocks outperformed a capitalization-weighted total market index by 123 basis points a year and did so with lower volatility.

Sounds OK I think, but maybe the comparison base is a bit of a straw man.

The larger point, I think, and maybe one more directly relevant to the “SuperCrunchers” phenomenon, is that when the elevator boys start “crunching” gigabytes of historical data of dubious provenance, quality and relevance, and producing “plausible stories” (”supported by the data”) then perhaps the rest of us should retire to our caves and consider how we might promote common sense, good theory and sound procedures (which, to a degree, the author of “The Little Book..” did, thereby adding a layer of plausibility).

Comments (2)

« Previous entries · Next entries »