Calibrate your irony meter: Everyone acknowledges that a high-fidelity data model is the result of hard work, yet most “best practices” of data modeling apply only to models that are already quite good. Shouldn’t best practices apply to models that suck?
More generally (and less dramatically), shouldn’t best practices acknowledge the early and intermediate stages of creating a data model? Isn’t that when a modeler really needs to be at his or her best—while working on a model that is poor and needs plenty of improvement? More than at any other time, I benefit from best practices when the wheels are falling off the bus: when I am confused, the users are irritated, and the in-progress model seems only to highlight the absence of a consensus about the data phenomenon the users purportedly agree upon.
In an earlier post, I alluded to this with these words:“Best practices in the kitchen focus on what happens at the stove, not merely how things are arranged on the plate. Why should modeling be any different?”
Alas, I see any number of best practices that ignore the nitty-gritty reality of producing a model with users. Today I’ll focus on one in particular:The oft-cited canard that a conceptual model does not require identifiers. And I do mean oft-cited; David Hay repeated this rule just last month in a discussion in another data-modeling forum:
“…adding an identifier composed of attributes and/or relationships can be done, but is not necessary in a conceptual model.”
I wish I lived in such a well-behaved universe, but I don’t and neither do you. Identifiers must be part of conceptual modeling for several reasons. First, identifiers are part of the user experience of information. Users employ identifiers to distinguish category members from each other. Many software professional believe otherwise—that identifiers are artifacts of software exclusively. That couldn’t possibly be true, because identifiers predate computers. Think of license plates. Come to think of it, license plates not only predate computers, they predate automobiles; the first license plates were used for bicycles.
As if that weren’t enough—and it ought to be because during conceptual data modeling, honoring the user experience is more important than everything else—identifiers can uncover and remedy the homonym problem, which arises frequently during the early and intermediate stages of conceptual modeling. This is not merely a pleasant side effect of identifiers; it helps to establish consensus about the meaning of business terms, which is one of the primary goals of conceptual data modeling.
The homonym problem occurs when two different categories are given the same name. An example:
- “Which flights have profit margins above three percent?”
- “Which flights were cancelled because of Hurricane Galinda?”
With these two questions, users are employing one word (“flight”) to refer to two categories:
- The category whose members include these two:
-
- Flight 877, daily from Tokyo to San Francisco
- Flight 295, weekdays from Boston to Paris
- The category whose members include these three:
- Flight 877 on Tuesday 06 March 2012
- Flight 877 on Wednesday 07 March 2012
- Flight 295 on Tuesday 06 March 2012
In this case, the word “flight” is a homonym—a single word with multiple meanings. This is a very common linguistic phenomenon, and it almost always sows confusion. Including identifiers as a fundamental, non-optional part of conceptual models goes a long way to uncovering and remedying the problem.For example, the two instances of the word flight would have separate identifiers that would make the differences between the categories manifestly obvious.If the two categories are already evident on the draft model as two entities either of which could be named “flight,” the use of semantically meaningful identifiers would clarify.If the draft data model shows only one entity named with the overloaded word “flight,” the discussion about candidate identifiers can reveal the homonym problem and the attendant need for two separate entities.
The typical rhetoric disputing the need for conceptual identifiers is easily refuted:
Claim: You don’t need conceptual identifiers because they are artifacts of software.
- Rebuttal: Identifiers are an information phenomenon, not a technology phenomenon.
Claim: You don’t need identifiers because users, when forced to be explicit, will avoid the homonym problem all by themselves.
- Rebuttal: There are many counterexamples of this, including the U.S. Supreme Court case Gutierrez v. Ada, in which a homonym problem involving multiple meanings of the word “election” caused ambiguity in the vote-counting process, even though that process was carefully designed and formally expressed in legislation that sought to eliminate ambiguity.
Claim: You don’t need identifiers because natural linguistic context will resolve ambiguities.
- Rebuttal: Context can resolve ambiguities between widely disparate concepts, such as flight (of stairs) and flight (of an airplane) and flight (of a fugitive). But on a data model, most pairs of homonym-candidate entities are not merely close, but adjacent—separated by a one-many relationship.This occurs especially often for planned vs. actual phenomena and type vs. instance phenomena.
One other claim I sometimes hear: You don’t need identifiers until it becomes obvious that you do need them. After the homonym problem arises, start insisting on identifiers.That is a risky bit of business. Identifiers don’t merely help to remedy the homonym problem, they help detect it. Without identifiers, many instances of the homonym problem will not be obvious until it is too late and you find yourself subject to the vagaries of post-deployment data-integration programs or even (shudder) the judicial branch.
-Joe Maguire
Co-author, Mastering Data Modeling: A User-Driven Approach
Some disclosure: Blog posts here will be written by me or my colleague Peter O’Kelly. Although Embarcadero will compensate us for these posts, we are solely responsible for their content. (Proof: We are unconstrained. The best practices offered here might or might not align with what you’ll find elsewhere on the ER/Studio site, in ER/Studio documentation, or in Embarcadero-sponsored whitepapers.)