Monday, 29 August 2011

modelling reality

Lord of the Rings in three volumes
I first came across the difficulty of modelling reality when I designed a database for my books. This was back on the early 1980s when I was a physicist, and had barely even heard the word "database" before I got ViewStore for my BBC Micro. What's a book? Easy, you might think. But what about Ace Doubles? (two novels in one binding). Or a novel in three volumes? Or the same book with a variant title? Or a new edition? Or a translation? Or an abridgement? Or a book published over several editions of a magazine? Or...

various alphabets
The next big realisation was alphabets. In the 1990s I was involved in the international standardisation of the Z formal specification language. We needed to standardise the character set, which included lots of mathematical symbols. And our Japanese collegues wanted non-Latin alphabets. How hard could it be to allow different alphabets? So I went off to read the Unicode Standard. Oh.

I was remined of these issues when I saw a post by Charlie Stross discussing the post "Falsehoods Programmers Believe About Names" by Patrick McKenzie. People's names are difficult to model.

So is time. Different calendars. Calenders with negative and positive years, but no year zero (which is why some people ended up celebrating the 2500th anniversary of the Battle of Marathon a year too early). Changing calenders from Julian to Gregorian -- give us back our 11 days (of tax payments, that is). Changing calenders at different times in different countries. Leap years. The algorithm for leap years (and the difference between the Julian and Gregorian algorithms). The argument about whether the year 2000 should be a leap year. Time Zones. Changing time zones. Summer time (aka daylight savings time). Summer time coming and going at different days in different countries. Double summer time. Leap seconds. And so on.

One of the comments in Charlie's post references "Gay marriage: the database engineering perspective", a post with an interesting analysis of marriage database design (ignoring the name problem), and how some designs make changes harder than they need to be.

Notice the continual use of "ids" in that post. Names (even if we decide how to model them adequately) are not unique, and so are not suitable for an identifier. What properties might make a good unique identifier? I recall hearing of a new police database that used "Surname, initial, date of birth" as a unique id. The story goes that this database was installed a few days before the Kray twins were arrested...

In fact, no attribute is immutable, and so should not be used in such an identifier. The same person who told me the Kray twins story also told me of the problems a hospital had in assuming that the "sex" entry was immutable when they did their first sex-change operation.

Wait a minute. No attribute is immutable? What about "date of birth" (dob)? How can that change? Well, remember this database information is a model of reality, not reality itself. Models can have errors. The dob might have been mis-entered, or been lied about (when my maternal grandmother died, the family discovered from her birth certificate that she was 10 years older than she had let on to her children), or it might simply be unknown.

How I killed Pluto
So ... if modelling designed (social) reality is this hard, why are we surprised that it is hard to model the natural world? If our own names, dates, whatever, refuse to fit some neat classification system, why should biology, geology, astronomy? Is a virus alive or not? Is something a separate species or not? Is Pluto a planet or not? When reality doesn't fit our classification, the fault lies not with reality, but with the classification system.

No comments:

Post a Comment