This wiki has undergone a migration to Confluence found Here
<meta name="googlebot" content="noindex">

Datatypes R2 Issue 29

From HL7Wiki
Jump to navigation Jump to search

Data Types Issue 29: PPD<T> issues

Introduction

PPD<T>: Revamp PPD to address two interests in the PPD data type extension. PPD<T>: we have two interests in the PPD data type extension. One group of people (Paul Schluter, GE, from LABPOCT SIG and IEEE/MIB group) wants more distribution types and more parameters.

Another group, represented by many, many casual users, need a simple confidence-interval like form. Those are people who now prefer to use IVL<T> rather than PPD because PPD seems alien to them.

Proposal is to revamp PPD (which had been left on status informative for the ballot) and accomplish both of these above goals. Review and extend the distributions and parameterization model, and at the same time add a confidence interval form and potentially a percentile form. The goal is to make PPD/confidence interval and IVL compatible types with the same XML rendition. That way we can gently stir people to using PPD for ranges with “pick one” semantics and away from IVL which has “all of those” semantics

Auxilliary: Is it possible, then, to construct a PPD<RTO>?

Tons of issues here. I have had extensive exchange with Paul Schluter about shortcomings. I think we need to do 2 things: (1) correct the parameter specifications, (2) consider adding (or including) a confidence interval notation. This also relates with the disambiguation of IVL and URG. Gschadow 21:58, 11 January 2007 (CST)


? backward compatible.

Original Email Thread

Paul

Tuesday, June 11, 2002,

Gunther and colleagues,

Thank you for your thoughtful reply. I agree with you that the describing PPDs brings up a number of complex issues.

Before considering the five options you proposed in your email, there is a significant observation about the HL7 V3 PPDs: many of the proposed HL7 PPDs require more than two parameters, especially regarding their 'location' and 'scale'. Except for the PPDs U, TRI, and N, the ability to specify the 'location' (a) and 'scale' (b) of a 'standard' probability distribution

f(x;a,b) = (1/b)f((x-a)/b;0,1)

is required for a completely general specification of a PPD suitable for modeling purposes.

With this in mind, I have listed the HL7 V3 PPDs with all the parameters that would be required to completely model the PPD. I have used the terminology (including 'location' and 'scale') based on the NIST on-line engineering handbook at http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm

The first group, {U, TRI, N}, provides a complete specification, including the ability to specify the 'location' and 'scale':

(null): mu, sigma U: min, max TRI: min, mode, max (proposed TRIangular PPD) N: mu, sigma

The second PPD group, {LN, G, E}, require the addition (#) of either the 'scale' or 'location' to provide complete generality:

LN: theta, sigma, scale# G: shape, scale, location# E: scale, location#

The third PPD group, {X2, T, F, B}, requires the addition (#) of both 'scale' and 'location':

X2: nfree, scale#, location# T: nfree, scale#, location# F: nfree1, nfree2, scale#, location# B: alpha, beta, scale#, location#

I have also attached a MS-Word table (based on an earlier HL7 V3 PPD specification) that shows this information. This is not a final version of a proposal, but just something to get everyone talking about what we should do.


RECOMMENDATION

Based on the observations noted above, I recommend that we adopt your proposed Option #1, where we describe the PPDs in terms of their native parameters, rather than just 'mean' and 'stdev'.

The reasons for this recommendation are:

1. Many PPDs require more than two parameters to specify them,

 especially when 'location' and 'scale' are considered.

2. This allows data sources to represent their results in the

 most clear and accurate manner possible.  For example, a
 triangular distribution provides a very simple and useful
 representation of a reported value (presumably the mode)
 as well as the lower and upper limits.

3. Recipients of this data can easily convert it to a 'mean'

 and 'stdev' representation, which is what we have today.
 More sophisticated systems could calculate intermediate
 probabilities either symbolically or numerically using
 Monte-Carlo and other techniques.


NEXT STEPS

If this basic proposal is acceptable to everyone, the next topics to consider include ...

1. NOMENCLATURE - Do we use 'mean', 'stdev' and other English labels? - Do we use spelled-out Greek letters? - Consider the terminology used in the on-line NIST handbook? - Is there another terminology standard we can use?

2. OTHER FEATURES - Support representation using confidence intervals? - Zero PPD outside of [minclip; maxclip] interval? - Support alternative {mean, mode, median} representation?

3. HL7 V3 STANDARD CONTENT - Include equations of PPDs? [highly recommended] - Include conversions to simpler 'mean' and 'stdev'?

Thanks again for your time and thoughtful review!

Regards,

Paul Schluter

Gunther

Paul, thank you so much for your input. This is really a very useful discussion. In the things below I may appear defensive, but I am not. I just want to understand. There is one key point that I hold up high, that's really extremely crucial, and that is the gracefulness towards those receivers who find a simple approximation enough for their purpose and who do not want to deal with the specifics of the distribution types.


Schluter, Paul (MED, GEMS-IT) wrote: > Before considering the five options you proposed in your email, > there is a significant observation about the HL7 V3 PPDs: > many of the proposed HL7 PPDs require more than two parameters, > especially regarding their 'location' and 'scale'. Except for > the PPDs U, TRI, and N, the ability to specify the 'location' (a) > and 'scale' (b) of a 'standard' probability distribution > > f(x;a,b) = (1/b)f((x-a)/b;0,1)

> > is required for a completely general specification of a PPD > suitable for modeling purposes.

I thought we had dealt with that issue of transforming the standard distribution which includes:

- translation of the origin (mean) to the right magnitude and

 unit of measure.

- scaling of the unit to the right magnitude and unit of measure

The mean and standard deviation parameters accomplish this. Here is the paragraph in the V3DT unabridged spec:

 "Probability distributions are defined over integer or real
  numbers and normalized to a certain reference point (typically
  zero) and reference unit (e.g., standard deviation = 1).  When
  other quantities defined in this specification are used as base
  types, the mean and the standard deviation are used to scale
  the probability distribution.  For example, if a PPD<PQ> for
  a length is given with mean 20 ft and a standard deviation of
  2 in, the normalized distribution function f(x) that maps a real
  number x to a probability density would be translated to f'(x')
  that maps a length x' to a probability density as
      f'(x') = f((x' - mu ) / sigma).


This is terribly similar to your formula above and you're probably right that we have to divide f(x) by sigma again to get it correct. But I wonder if you have even considered that we are in fact addressing this issue in your thoughts until now?


> With this in mind, I have listed the HL7 V3 PPDs with all the > parameters that would be required to completely model the PPD. > I have used the terminology (including 'location' and 'scale') > based on the NIST on-line engineering handbook at > http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm > > The first group, {U, TRI, N}, provides a complete specification, > including the ability to specify the 'location' and 'scale': > > (null): mu, sigma > U: min, max > TRI: min, mode, max (proposed TRIangular PPD) > N: mu, sigma


If you take TRI out for a moment (what has sparked this whole thing) you are basically agreeing that the two "standard" parameters do the job, O.K. so we are on the same page.


> The second PPD group, {LN, G, E}, require the addition (#) of > either the 'scale' or 'location' to provide complete generality: > > LN: theta, sigma, scale#


Hmm, in your word table it looks like you did buy the mu_log, sigma_log idea. Which would take care of the translation and the scaling. It looks to me that theta is really doing the translation and m and sigma (on the NIST reference) is doing the scaling work.

But there is something going on here with the split between sigma and m, I can't quite grasp what it is. This needs review to relate this to the mu_log and sigma_log stuff that] is in the table now.

> G: shape, scale, location#


Here I have the same question. From my source [Mendenhall et al. Mathematical statistics with applications] I have the gamma distribution defined as:

       /  x^(alpha-1) * e^(-x/beta)
       |  ------------------------- ; alpha, beta > 0; x > 0
       |  beta^alpha * Gamma(alpha)

f(x) = <

       |
       |
       \  0                         ; otherwise


and it goes further on saying that

  mu = alpha * beta

and

  sigma2 = alpha * beta2.

So, shouldn't I then be able to specify the gamma distribution from only knowing mu and sigma?

Why do I need a third parameter, it seems to me that that one would be dependent?

But may be the answer is that this only describes the standard Gamma distribution and when I have used up mu and sigma to derive alpha and beta I don't have anything left to do the scaling and translation.... hmm, that's probably it.


> E: scale, location#


Here you have only two parameters, so mu and sigma is enough to derive those.


> The third PPD group, {X2, T, F, B}, requires the addition (#) > of both 'scale' and 'location': > > X2: nfree, scale#, location# > T: nfree, scale#, location# > F: nfree1, nfree2, scale#, location#


o.k. here are the degrees of freedom. And I get the same issue as above with Gamma. But I can see that when I used mu and sigma for the transformation I still might need nfree as an extra one.


> B: alpha, beta, scale#, location#

Beta should have the same issues as Gamma. Wonder why you need one more parameter?


> RECOMMENDATION > > Based on the observations noted above, I recommend that we adopt > your proposed Option #1, where we describe the PPDs in terms of > their native parameters, rather than just 'mean' and 'stdev'. > > The reasons for this recommendation are: > > 1. Many PPDs require more than two parameters to specify them, > especially when 'location' and 'scale' are considered. > > 2. This allows data sources to represent their results in the > most clear and accurate manner possible. For example, a > triangular distribution provides a very simple and useful > representation of a reported value (presumably the mode) > as well as the lower and upper limits. > > 3. Recipients of this data can easily convert it to a 'mean' > and 'stdev' representation, which is what we have today. > More sophisticated systems could calculate intermediate > probabilities either symbolically or numerically using > Monte-Carlo and other techniques.


Paul, the one basic concern I have with this is that it doesn't allow the receiver to be "distribution ignorant." This will not work in practice. This forces all receivers to understand all the distributions.

I am certainly not against doing what is right, and it seems like the assumption that all parameters could be derived from mu and sigma was mislead (I'm still not sure but my intuition says that there is a point.) So, we definitely must allow for more parameters.

However, I think your point 2 is less of a strong argument than it seems to be. The question, regarding the distribution ignorant receiver, is this:

  Who carries the burden of making sense of the parameters?

If you always carry mean and variance, then any receiver has at least some sense about what's going on without calculating anything. That to me is a greater good than saving those who do know their distribution type from doing some transformations, don't you think?

So, your point 3 is not true. It is not at all easy for a recipient that is distribution ignorant to make any sense of the special parameters.


So, this is still a problem.


I could see this solved in two ways.

1) always specify a standard measure of expected value and

  variance that can be used by distribution ignorant receivers
  IN ADDITION TO the necessary parameters.

2) reduce the dependent parameters such as to specify the

  expected value and variance and only those parameters that
  are needed in addition.

In order to come to a consensus, I even weaken my two options a little bit: I do not insist on "expected value" (= mean proper) and I do not insist on variance (proper) as the estimate of magnitude and variance for the distribution ignorant receiver.

Instead, we could play with using the mode for the triangular distributions, may be even using the mode for the Gamma distributioin, may be use the asymptote for a log-normal or gamma distribution that has no finite mode. I am open to using whatever is a close enough estimate that would allow the receiver to make some estimated (more or less accurate) computations with the values, without understanding the distribution type.

I am quite positive that accomodating those ignorant receivers is a crucial criterion to make the whole PPD thing be useful in todays world (and quite a few years down the road.) We cannot expect all receivers to know all distribution types. It's much safer to expect a sender of a specific distribution type to know his distribution type and do the necessary transformation of the parameters that he internally may use.


> NEXT STEPS > > If this basic proposal is acceptable to everyone, the next > topics to consider include ...


I think we need to wrestle with the above a little more.


> 1. NOMENCLATURE > - Do we use 'mean', 'stdev' and other English labels?


This is more about what our common parameters will be that everyone can use. If they are mean and standard deviation (mu and sigma) then why not use those names? If they aren't we adjust accordingly.


> - Do we use spelled-out Greek letters?


For what? The codes for Gama and Chi-square can remain as they are, don't you think? Or do you mean for parameters mu, sigma, nu, alpha, beta, theta, etcetera? That would be O.K. for the specific extra parameters. I would recommend to have more accessible names for the two standard parameters for magnitude and variance.


> - Consider the terminology used in the on-line NIST handbook?


certainly consider :-)


> - Is there another terminology standard we can use? > > 2. OTHER FEATURES > - Support representation using confidence intervals?


This goes into the triangualr min-max and your minclip maxclip things too I guess.

May be what we'll end up doing is allowing a big set of parameters to be specified including the common parameters for the ignorant, all the special parameters (even the redundant ones) plus confidence limits and confidence level. That way we provide maximum degree of freedom (.... but who will get humpty-dumpty back together again?)

> - Zero PPD outside of [minclip; maxclip] interval?


You may want to elaborate on that further.


> - Support alternative {mean, mode, median} representation?


...


> 3. HL7 V3 STANDARD CONTENT > - Include equations of PPDs? [highly recommended]


in the unabridged specification, absolutely.


> - Include conversions to simpler 'mean' and 'stdev'?


yes, but this discussion is really first up above.


Paul, I really appreciate your input. It's great to finally consider this thing ernestly. You make a great contribution!


thanks, -Gunther

Lloyd

Hi Thomas,

Sorry that things got a tad overheated. The great thing about HL7 is that there *is* debate (even fierce debate) about a wide variety of issues. The even greater thing is that HL7 tends to attract people who focus their debate on what will work and what will be good for the practice of *real* healthcare. I believe this is such a discussion.

I *have* read your response to the end, and will now go through and insert my comments :>


Lloyd McKenzie, P.Eng. I/T Architect, IBM Global Services Internet: LMCKENZI@CA.IBM.COM PhoneMail: (780)421-5620 Internal Mail:04*K0R*1004 *EDM


Thomas Beale <thomas@deepthought.com.au>@lists.hl7.org on 2002-06-11 10:32:07 PM

Please respond to Thomas Beale <thomas@deepthought.com.au>

Sent by: owner-cq@lists.hl7.org


To: Gunther Schadow <gunther@aurora.regenstrief.org> cc: Mead Walker <mead@voicenet.com>, "Biron,Paul V"

     <Paul.V.Biron@kp.org>, Lloyd McKenzie/CanWest/IBM@IBMCA, "Schluter,
     Paul (MED, GEMS-IT)" <Paul.Schluter@med.ge.com>, "'HL7-CQ'"
     <cq@lists.hl7.org>, Sam Heard <sam.heard@flinders.edu.au>, Andrew
     Goodchild <andrewg@dstc.edu.au>

Subject: Re: Problem of data instances and uncertainty


Dearl all,

I will try to maintain what I thought was a useful discussion on this topic, although I think it has become needlessly emotional. I will just provide a summary of what I think is important, but first I want to make a few background comments.

1. I understand these lists to be for the purpose of debating, communicating, educating, and ultimately progressing the ideas which make up standards and related artifacts. In my opinion, the best qualities of such fora are diversity of opinion and clarity of thinking; this means that not all people think the same, but as has been shown throughout history, diversity and dialectical discussion is a proven way to advance ideas. I don't believe that people with differing points of view should be characterised as spreading FUD, or in other negative personal ways; we have politicians and the media to do that. Also, my own points of view are generally not solely individual opinions, but synthesised from many other people's thinking (and I imagine the same is true of most participants here), so there is no point in being personal.

2. It is important that people in HL7 realise that there are and will be EHR and other systems everywhere which do not internally (or even at regional or national levels) use HL7 standards, or any one standard at all. This is just the way of the world, and there are too many reasons for it to go into here. Some will use CEN standards, some will use ASTM standards, some will use OMG standards, some will use none, and provide gateways which process HL7 messages. The reality is that there will be many systems which accept or export data which do not natively conform to the HL7 model of data, so it is important to realise that the HL7 data types model does need to take account of this situation. Hence my motivation for discussing some of the issues brought to light on this list.

3. For the record, I happen to think the HL7 data types work is an excellent piece of work. I do not agree with every thing in it, nor with all of its design principles, but the disagreement is not one of legitimacy, but of fitness for purpose (and I think we can all agree that that is a very tricky question in standards development, since it requires a certain omniscience to be correctly guess how well a model or solution will work in circumstances we don't know of or understand right now). Regardless, the quality of thinking is far above what one finds in many software libraries, computing texts etc. If I have not made this plain enough before, I hope I do so now.

4. A few times Gunther has accused me of not understanding "interface v implementation". If you knew my software engineering background, you would not say that, as I spent ten years on writing and educating software based on just these principles (but there is no reason you would know that ;-) . The first point is: I understand the HL7 RIM and DT specification _solely_ as logical models, or interfaces if you like; I never thought there was any ambiguity about this. When I mention "implementation" in these lists, I am not talking about me writing some code, I am talking about software engineering economics and data quality - the real world consequences of design decisions in standards models writ large. Every decision made in standards development is multiplied a hundred thousand times in implementation-land, and a billion-fold in data-land. So it behoves us to very carefully consider the consequences of what we develop and hand out to the rest of the world.

Now, to the matter at hand. I see the debate as one about design principle, in particular the principle of using structured data items to represent unreliable (i.e. partial) raw data. Here is my summary of the different design approaches, including concerns about consequences for software development and data. The main concern is to do with the quality and characteristics of data _in EHR systems_ - remembering that the internal design of such systems, and their communicable extracts may not use the same design of data, so clearly there is a concern with receiving messages with what appear to be structured data instances, but where there are attributes whose value is "UNK". Please feel free to disagree, find holes etc, but please read the whole thing before commenting.

  *     *     *     *     *

All HL7 data types inherit from the ANY class (equivalent to the DATA_VALUE class in openEHR) which contains the attributes:

    BL nonNull;
    CS nullFlavor;
    BL isNull;

The purpose of these attributes is to indicate whether a datum is Null, and for what reason. Since some data type classes also appear as the attributes of other data types, the Null markers also indicate whether any part of a datum is null. Thus, in the class Interval<T> shown below, all attributes have the possibility of containing a Null marker.

    type Interval<T> alias IVL<T> extends SET<T>
    {
        T low;
        BL lowClosed;
        T high;
        BL highClosed;
        T.diff width;
        T center;
        IVL<T> hull(IVL<T> x);
    literal ST;
    promotion IVL<T> (T x);
    demotion T;
    };

The consequence of this is that the entire model is essentially a model of potentially "partial" data types; any attribute and any function call may return a Null value as well as the true values of its type. This design decision was taken so that any datum, no matter how unknown, would be structurally representable in the same way as completely known data, enabling it to be processed in the same way as all other instances of the same type (and possibly fixed later on if more raw data became known). Possible consequences of the built-in Null marker design approach:

    software will be more complex (than without it) - both
    implementations of the data types, and of software which handle them.
    This is because the software always has to deal with the possibility
    of calls to routines and attributes returning Null values. Cf most
    EHR systems to date have taken the approach that a datum is either
    represented as an instance of a formal type if fully known, or else
    as narrative text if only partial;
 <Lloyd>I can't argue with you here.  Adding *any* new attribute, element
 or capability into a specification has the potential to make it more
 complex.  The question therefore is whether the cost of the added
 complexity is worth the benefits provided by the new
 attribute/element/capability.  The decision of course is heavily
 influenced by the intended use.  I suspect the crux lies there.  For
 your vision of the EHR, you perceive the datatype's approach to nulls to
 be of limited benefit and potentially of detriment.  I can't really
 argue whether they will be beneficial to your use-case, though I will
 (attempt :>) to show that it would be hard for them to be detrimental.
 HOWEVER: Before we go any further in the debate, let me offer an escape
 clause:
 If, for a particular application, you feel it is inappropriate for some
 (or any) of the values being considered to be null, it is perfectly
 possible to develop a restriction on a message type in which all
 attributes are marked as 'not null' and which have similar constraints
 against their properties.
 That was actually a nasty thing to say, as it changes the burden of
 proof somewhat.  It means that the issue now becomes "is it appropriate
 to have this treatment of nulls anywhere", as opposed to "is it
 appropriate in my usecase".  I would however like to keep the argument
 fairly open, and target it at "is the current treatment of nulls
 *potentially* useful in the majority of use-cases for HL7, and at the
 very worst, is not harmful to any of them."
 There is also a caveat to be aware of when forcing values to be not
 null.  When you make something not null, the ramification is that no-one
 can create a document/message/whatever unless they know that value.  If
 they really want to create the document/message/whatever, the temptation
 will be to fudge the data to get the @~#%%@* system to create the order.
 If on the other hand, you only require values for those things which are
 absolutely essential for the function to complete (and which will
 therefore hopefully make sense and seem reasonable to those entering the
 data), you're less likely to run into this problem.
 </Lloyd>
    data may not be always be safely processable, since some software may
    not properly handle the null values associated with attributes of
    partially known data items. Essentially, all software which processes
    the data has to be "null-value aware".

<Lloyd>I can hear Gunther already :> "Potentially harmful how?" There is of course the general argument that adding a new attribute/element/anything to a specification might be a bad idea because it makes things more complex, and because implementors could screw it up. The argument applies to everything, and if we paid it much attention, we'd never get anything done. I'll presume you mean something different. I see two possibilities: 1. You are concerned that the specification (as written) is overly hard to understand, is incomplete, or lends itself to easy mis-implementation. If so, it would be helpful if you could identify the particular places where you find this to be the case. (I.e. what is hard to understand, what could be added to make it clearer, where are you most worried the implementors will mess up.) 2. You believe even if implemented correctly, the 'seeming' presence of information in computable form will lead people (or software) to make different (and more eroneous) decisions than if the information had not been present at all. If so *please* provide a scenario - any scenario - where an application could receive a piece of data containing a null property (or even several null properties) and when implemented according to the specification could result in erroneous decisions, statistics or other problems. (This may be completely obvious to you, but neither Gunther nor myself seem to be able to see it :>) </Lloyd>

The HL7 data type model is in contrast with simpler approaches such as used in CEN, GEHR, openEHR, many EHR systems, and probably a lot of general purpose software data-type libraries (not saying all), where data types are formal models of types such as Coded_term, Quantity, and so on. Rather than build the possibility of null markers into every attribute and class in the data types, a single null marker is defined in relevant containing classes. This decision is based on the principle that data types should be defined independent of their context of use. Hence, where data types are used as data values, such as in the value attribute of the class ELEMENT from the openEHR EHR reference model, the parallel attributes is_null and null_flavour are also defined (taken from the HL7 analsis by the way). However, where data types appear as attributes elsewhere in the model and there is no possibility of them being null, no null markers are used. The consequences of this decision include:

    data types can be more easily formally specified, since the semantics
    of invariants, attributes and operations do not need to include the
    possibility of null values;
 <Lloyd>Less is always more easily defined.  However, the work of
 defining the semantics has been done, so we don't really have to worry
 about the effort of defining them (unless you feel they are
 incomplete?)</Lloyd>
    software implementations are simpler;
 <Lloyd>Discussed earlier</Lloyd>
    data are always guaranteed to be safely processable for decision
    support and general querying, since no instance of a formal type will
    be created in the first place if the datum is very unreliable;
 <Lloyd>Again, we *really* need an example.  Let me take a slightly
 different approach:
 When an application (or a person) wants to look at or manipulate a
 datatype, they tend not to need or use the *whole* datatype.  Rather,
 they use one or two properties of the datatype.  For example, if the
 question is "Does the IVL<a,b> contain x", you need to know two
 properties for the interval: lower bound & upper bound.  If you only
 know one or the other, you can answer the question sometimes.  If you
 know neither, you can't answer the question at all.  The fact that the
 properties you need are null means that, from the perspective of your
 question, the datatype *is* null.  (Exactly the same as if we just had
 one big null indicator at the top.)  However, the fact that you can't
 answer one question doesn't mean you can't answer any question.  Using
 Gunther's earlier example, if you have an interval with null values for
 3 of the 4 properties: lower bound, upper bound and center; but an
 not-null value for width, you *still* have information, and can still
 answer questions about the data.  E.g. How long did it take?
 The absense of the other 3 properties does NOT necessarily mean that the
 property that is present is any more or less 'error-prone' than if all 4
 had been present.  If the IVL represents time taken to perform a
 procedure, the recorder might put down that it took 5 minutes.  That
 they didn't say the procedure went from Jan 17, 2002, 15:43 to Jan 17,
 2002 15:48 does not mean that the 5 minutes is invalid.  If they wanted
 to be really precise, they could even have specified 5 minutes ± 2
 minutes.  The important point is we *still* have information, and the
 information is useful.  You might not be able to answer all questions
 with it, but at least you can answer some.
 An alternative would be to elevate the properties so you can specify
 each of them independently.  E.g. Procedure start time, Procedure end
 time and Procedure duration (unlikely anyone would ever record or ask
 Procedure half-way point :>).  The risk you run then is that the the 3
 values specified would not agree (i.e. end - start != duration).  The
 beautiful thing about using IVL is that it prevents that kind of
 inconsistency.  If you only have 1 piece of information, that's ok, we
 can store it.  If you have 2 pieces of information, great, we can store
 those and automatically infer the rest.
 </Lloyd>
    null markers only appear in models where they are relevent, rather
    than everywhere data types are used;
 <Lloyd>I agree.  However, null is relevant any place where information
 might not be known or complete.  In the general case, this can be pretty
 much everything.  Within the context of a particular message (or
 structured document, or EHR component), this can be constrained such
 that the creating committee can say "If you don't know at least X, Y and
 Z, then you can't create the message/document/component"  As a general
 rule, MnM has said that committees should only do this if there is truly
 nothing useful that can be done with the message/document/component.
 What pieces of information are necessary will vary based on context &
 use-case.  Let me take pharmacy as an example (seeing as I play there a
 lot :>)
 Situation A. If somebody sends an order for a substance to be
 administered to a patient, but doesn't identify the drug to be
 administered, but in the message in the place of the drug sends a Null
 with a flavour of "protected due to patient confidentiality", you have a
 useless message.  No-one can act on the order, because they don't know
 what to administer.
 Situation B. A provider queries the database for a list of medications
 that the patient is taking.  The system returns a list of drugs, however
 for one of the drugs, the drug information has been replaced with a Null
 having the same flavour as above.  In this case, the message is still
 useful.  The provider knows all of the non-null drugs.  Furthermore,
 they know that the patient is on at least one other medication that was
 not listed, but that the system is aware of and making checks against.
 Perhaps there is a procedure so that that provider can gain information
 about the hidden drug if they have concent/justification.
 The situations outlined above are based in real life - we have a
 province-wide system starting to roll out that does just that.
 </Lloyd>
    however, the openEHR data types do not automatically deal with
    missing or unknown internal attribute values (such as missing high
    and low values for the interval).
    some raw data will not cause the creation of structured data
    instances , and will not be included in queries (except queries that
    explcitly search for them)
 <Lloyd>You always have the option to exclude information from queries.
 In general, you'll exclude it from queries if the properties you need
 access to to perform the query are null.  If the properties you don't
 need happen to be null, who cares :> </Lloyd>

In order to deal with the last two points, various approaches are used in openEHR:

    for most data which is not fully known, no data type instance is
    created, and a null marker is created. Depending on the design of the
    revelant archetypes, there will usually be the possibility of
    recording the datum in narrative form;
    ENTRYs in the openEHR EHR reference model include a certainty:BOOLEAN
    attribute, for recording a level of doubt;
 <Lloyd>HL7 allows marking of certainty a variety of different ways.  You
 can use formal probability models (useful in some areas, unlikely to be
 used in others).  The closest corresponding concept would be UVN
 (uncertain value narative) which allows a code to be associated with the
 value indicating how certain/uncertain the value is.  It allows a
 slightly broader range than 'yes/no', but the principle is the same.
 Please note that uncertainty about the reliability/accuracy of a value
 is a different concept than whether the value is known or not.  (I can
 be 100% certain that I don't know the value :>)  We also have UVP which
 allows the probability to be expressed as a percentage.</Lloyd>
    for particular data types which are often partial, special types are
    defined. The main type affected is DV_DATE and DV_TIME, hence the
    types DV_PARTIAL_DATE and DV_PARTIAL_TIME exist to define explicitly
    the semantics of dates with a missing day, times with missing seconds
    and so on;
 <Lloyd>I presume you do this so that you can enforce a particular level
 of precision in different attributes.  For example, you might allow a
 partial date for DateOfBirth, but not for EncounterAdmitDate.  HL7 has
 already defined some sub-types that act as restrictions on their parent
 types for common scenarios.  It may make sense to do similar things for
 timestamp.  (I'm sure PA would be horrified if they realized that
 restricting Encounter.effective_time from GTS to IVL<TS> still allows
 someone to say IVL<2002 .. 2002>  :>
 </Lloyd>
    for expressing uncertainy more precisely, probability distribution
    data types (based on the types defined in HL7) can be used.

Now, Gunther has made the following point:

    So, I am saying that your approach creates needless false negatives
    while our approach does NOT create false positives.

The example that Lloyd supplied shows correct processing, even if the data is unknown:

           Now let's assume we have an start date, but no end date:
           Contains(IVL<20010101..Unknown>, 20011026) = Unknown
           Contains(IVL<20010101..Unknown>, 20001231) = false
           Contains(IVL<20010101..Unknown>, 20020102) = Unknown

I am not suggesting this won't work. We have Partial_date and Partial_time types for this sort of thing, and if sufficient raw data is available to create one of these, there will be no false negatives on those particular items. I can imagine a Fuzzy_interval class as well, but it would not be a sensible thing to define unless you state invariants requiring some kind of values for the limits (presumably with fuzzy or probabilistic qualification).

My main objection computationally to the IVL example was if both limits were missing, because then the IVL is no longer anchored on the axis of its base type (say TS or PQ); then there is no way for the Contains() routine to do anything useful at all (nor the other Set<T> functions). I.e., past a certain level of unknown-ness, the item is no longer useful in computation. <Lloyd>Correct. If you're missing the upper and lower boundary, you can't do anything with Contains() at all. From the Contains() point of view, the whole IVL is Null. However, from the "How long did it take" point of view, you can still get back a perfectly reasonable (and correct) answer. The trick is in your use-case and in the constraints you apply at design time. If the ability to ask a Contains() question about a particular attribute is critical to the usecase for a message/document/EHR component, then you restrict those properties to be Not Null when you define the message type. If, on the other hand, Contains() is simply a "nice-to-have", but is not essential to business flow, you can allow people to send the information they do have, which might be useful somewhere else.</Lloyd>

But there is a more general point: data items where some elements are completely known and the some are completely unknown are not very reliable as a whole (dates and times are an exception, because the unknown bits are at finer precision) and it is questionable if the "known" parts are that reliable; it all depends on the real-world situation at hand. In general, sources of very partial raw data may not be trustworthy even for those elements they claim to know. At some point, some items of data are so useless as to always fall into the bucket of "false negatives"; this is always the case - all statistical analyses try to avoid such unreliable data (or more properly, corral it in to its own category, e.g. as with informal votes in an election - at least you can count them).

<Lloyd>The assertion "If some elements are completely known and others are completely unknown means data is unreliable" needs a bit more backing up. Do you have some examples? In terms of coralling the data, you can definately still do that. Whether you want to/need to depends on your purpose. Consider a survey of 10 questions with a range of possible answers, including 'unknown'. If in the response one of the answers is marked as unknown, the analyst has several choices. a) They can exclude the whole response from the survey, which may be necessary if the statistical analysis being performed involves computations that interrelate the answers to all 10 questions, and they can't be processed if they contain a 'null' values. b) They can exclude the one answer, but process the rest, which will work just fine if the questions are independent, but 'null' values can't be handled. c) They can process the whole result, including the 'unknown' value, because the fact that it is unknown is of statistical or analytical interest.

Unless you are in situation a), you are still better off to have the 'unknown' option on the survey because you get at least some information. You might even be better off in situation (a), because if people have the option of saying "I don't know" they are less likely to pick their 'best guess' or just fill something in so that the survey is complete.

Allowing the 'null' values for properties seems to allow better quality analysis than taking the approach of "fill out the survey if you know everything, otherwise don't bother." </Lloyd>

What I am not convinced about it the generic provision to allow Null values everywhere inside all data types, since this complicates both the specification and its implementations. Apart from the very common cases of partial date/time data, I am not convinced that the extra complexity is necessary.

<Lloyd>Handling null for legacy systems that have never really dealt with the concept will indeed be hard. However, dealing with the concept from an object-oriented system point of view really isn't that challenging. So long as you know how to handle nulls when making your inferences (which the datatype document does a reasonable job of explaining), I think you're ok. Allowing values to be null allows us to more accurately reflect the real world, where not all things are known all the time. In those circumstances where we *need* something to be known, we can force that to be the case. However, from a design principal, you're safer with the assumption that anything can be unknown, and we'll force to be Not Null those things that we just can't handle otherwise. This assures that you can get and process the most information possible for your use-case. If on you instead take the approach 'everything must be known, unless I explicitly declare otherwise', you are at risk of not allowing for the numerous edge cases where something really isn't known. Name, date of birth, even gender are fields that are commonly made mandatory in systems, however it is rare that absolutely no processing can be done without them. There are many cases in the real world where these values simply aren't known, and yet you still need to be able to order a drug, etc. If a system finds that it can't make a determination because the data is missing, it can ask (e.g. Drug not recommended for females, gender). The physician can then work up the courage to ask the androgenous looking patient whether they are male or femalie :> </Lloyd>

It is a common idea in the EHR arena and in the computing field in general that structured data items should only be created: - when they are required for computational use (i.e. not jsut display for humans) - when the required raw data exists to do so <Lloyd>The *only* reason we have nulls at all is for machine processing. Humans are particularly adept at handling missing information. It's machines that need to be explicitly told. The current datatype approach allows the *available* data to be recorded. In circumstances where certain data is required, we have the ability to enforce this.</Lloyd>

Because of this, many systems are designed and implemented on this principle. When some HL7 messages are received containing what appears to be structured data, but where important values are missing (making the data items invalid with respect to the system's own data item model) then the system has to a) recognise the difference and b) do something with the invalid items. <Lloyd>If messages are designed appropriately, important (or at least essential) items will never be missing, because the committee has identified them as essential and indicated that they may never be null. It is only the 'unessental' data that can be null.</Lloyd>

One idea that has occurred to me in all this is that what we call a "free text" representation of partial or unreliable data could in fact contain an XML text item whose structure corresponds to the HL7 way of doing things - all the tags are there, just "UNK" in the value positions. If indeed the missing bits magically turn up, the XML could be more easily used to reconstruct the item than unstructured text.

<Lloyd>That's more or less what we do have, although I think we use an attribute to store 'UNK'.</Lloyd>

- thomas beale

Paul

Wednesday, June 19, 2002,

Gunther and colleagues,

Gunther, I believe there may be a solution to the concerns you raised about a "distribution ignorant" receiver not being able to process an arbitrary parametric probability distribution (PPD) that was sent to it.

In the current HL7 Version 3 draft standard, all PPDs are reported by the sending device as a PPD_type, mean and standard deviation. Although this representation will generally suffice for most applications, significant information can be lost, especially if the underlying PPD requires more than two parameters to model the underlying distribution.

Another alternative that doesn't impose too significant a burden on the a receiver is to have the receiver calculate the mean and standard deviation based on the PPD_type and the parameters that are 'native' to that PPD. This would allow the sender to specify the PPD and its parameters in the most appropriate and accurate format. Receivers that were PPD-capable could fully exploit this information whereas a less capable receiver would convert the parameters to the simpler mean and standard deviation.

In the attached note, I have listed the PPDs as well as their defining equations, and the equations to convert to the mean and standard deviation format (the latter based on many of the equations provided for the current HL7 V3 PPD definitions). I have retained the optional 'scale' and 'location' parameters where appropriate. I have also added two new PPDs, 'triangular' and 'trapezoidal', that provide a simple way of expressing asymmetric uncertainty and the equations for calculating the mean and standard deviation.

I hope this proposal provides a good starting point for further discussion!

Thanks and regards,

Paul Schluter

Gunther Schadow wrote on 6/11/02:

...

> > Paul, the one basic concern I have with this is that it doesn't > allow the receiver to be "distribution ignorant." This will not > work in practice. This forces all receivers to understand all > the distributions. > > I am certainly not against doing what is right, and it seems > like the assumption that all parameters could be derived from > mu and sigma was mislead (I'm still not sure but my intuition > says that there is a point.) So, we definitely must allow for > more parameters. > > However, I think your point 2 is less of a strong argument than > it seems to be. The question, regarding the distribution ignorant > receiver, is this: > > Who carries the burden of making sense of the parameters? > > If you always carry mean and variance, then any receiver has at least > some sense about what's going on without calculating anything. That > to me is a greater good than saving those who do know their > distribution type from doing some transformations, don't you think? > > So, your point 3 is not true. It is not at all easy for a > recipient that is distribution ignorant to make any sense of the > special parameters. > > > So, this is still a problem. > > > I could see this solved in two ways. > > 1) always specify a standard measure of expected value and > variance that can be used by distribution ignorant receivers > IN ADDITION TO the necessary parameters. > > 2) reduce the dependent parameters such as to specify the > expected value and variance and only those parameters that > are needed in addition. > > In order to come to a consensus, I even weaken my two options > a little bit: I do not insist on "expected value" (= mean proper) > and I do not insist on variance (proper) as the estimate of > magnitude and variance for the distribution ignorant receiver. > > Instead, we could play with using the mode for the triangular > distributions, may be even using the mode for the Gamma > distributioin, may be use the asymptote for a log-normal or > gamma distribution that has no finite mode. I am open to using > whatever is a close enough estimate that would allow the receiver > to make some estimated (more or less accurate) computations > with the values, without understanding the distribution type. > > I am quite positive that accomodating those ignorant receivers > is a crucial criterion to make the whole PPD thing be useful > in todays world (and quite a few years down the road.) We cannot > expect all receivers to know all distribution types. It's > much safer to expect a sender of a specific distribution type > to know his distribution type and do the necessary transformation > of the parameters that he internally may use.

...

Lloyd

Hi Gunther,

Paul's discussion of a trapezoidal distribution reminded me about a special distribution I'd requested, and I thought I should confirm that it will also be present.

The distribution has zero probability outside of a defined interval, and unspecified probability within the interval(with the natural constraint that the area under the probability 'curve' is equal to 1).

The purpose is to handle dosage uncertainties such as 1-3 tablets, TID. The dose could be 1, 2, 3, 1.5, 2.75, etc. However, each of these potential dosages do not have equal probabilities. (2.75 is quite unlikely

>). All that we know is that the dosage is some value within the

interval.


Lloyd

Paul

Tuesday, June 25, 2002,

Gunther and colleagues,

Gunther, I have taken your concerns to heart regarding the PPD-ignorant receiver, and agree with you that the sender should always report the 'value' and the 'standardDeviation'. This seems to be about the only way to ensure a reasonable degree of interoperability with PPD-ignorant receivers.

This will also give us the freedom to specify new or more sophisticated PPDs in the future without breaking basic compatibility with receivers that may not understand the new PPD types or those that are simply just PPD-ignorant. For example, PPD 'mixtures' (i.e. a PPD that is a weighted combination of one or more basic PPDs) and confidence intervals can be easily described, based on the proposal below.

To accomplish this, we first would re-adopt the XML PPD encoding similar to the earlier Ballot #1 definition of a PPD, ' <somePPD_PQ T="PPD_PQ" value="4.5" stdDev="0.1" ppdType="N" unit="mmol/L"/>

where the 'value' and standard deviation 'stdDev' are required attributes that provide the information for the PPD-ignorant receiver [A], [B]. This is essentially equivalent to the PPD described in HL7 V3 to date.

The following sections illustrate how additional child elements can be used to enhance and extend this simple PPD representation.


1. PPD WITH 'NATIVE' PARAMETERS

A PPD_PQ with a triangular distribution is shown below:

' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.288" ppdType="TRI" unit="mmol/L"> ' <ppd_TRI a="4.0" b="5.0" c="4.5"/> ' </somePPD_PQ>

The attributes 'value' and 'stdDev' provide interoperability with PPD-ignorant recipients and the PPD is specified by a child element using 'native' parameters. The 'unit' attribute applies to both the 'value' and 'stdDev' as well as the child PPD.


2. PPD MIXTURE WITH NON-OVERLAPPING PPDs

Specifying the native PPD as a child element makes it very easy to define more sophisticated PPDs that include 'mixtures' of continuous as well as discrete PDFs. For example, a PPD with a central peak and two sidelobes could be described as a mixture of three Uniform PPDs: ' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.0953" ppdType="MIX" unit="mmol/L"> ' <ppd_U a="4.0" b="4.45" prob="0.045"/> ' <ppd_U a="4.45" b="4.55" prob="0.91"/> ' <ppd_U a="4.55" b="5.0" prob="0.045"/> ' </somePPD_PQ>

The absolute contribution of each PPD is specified by 'prob' (as in 'probability') and the sum of the probabilities should be unity.

The reported 'value' should be consistent with the PPD mixture; in the example above, the reported 'value' is equal to the mean of the three Uniform PPDs, and the reported 'stdDev' is also derived from the three PPDs. The sender could also report the mode or median as the 'value', independent of the underlying PPDs, and we should allow this.


3. PPD MIXTURE WITH OVERLAPPING PPDs

PPDs can also overlap in a 'mixture'. For example, the PPD in #2 could also be represented as:

' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.0953" ppdType="MIX" unit="mmol/L"> ' <ppd_U a="4.45" b="4.55" prob="0.9"/> ' <ppd_U a="4.0" b="5.0" prob="0.1"/> ' </somePPD_PQ>

where the PPDs are stacked above each other (like a wedding cake).


4. PPD MIXTURE WITH CLIPPING AND CONFIDENCE INTERVALS

Two additional and optional elements, 'ppd_minClip' and 'ppd_maxClip', can be used to truncate the range of the random variable for a single or mixture PPD. In the example below, the Normal distribution is clipped at 2.58 times the stdDev, so that the area under each tail equals 0.005.

' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.1" ppdType="N" unit="mmol/L"> ' <ppd_N mu="4.5" sigma="0.1"/> ' <ppd_minClip a="4.242" prob="0.005"/> ' <ppd_maxClip b="4.758" prob="0.005"/> ' </somePPD_PQ>

Independent of whether this distribution is clipped or not, one could say that "we are 99% confident that the 'value' ('mu') lies in the interval [a,b] = [4.242,4.758]" for the distribution shown above. It would also be possible to specify component PPDs within a mixture to facilitate easy identification of the '95% percent' and other confidence intervals.

In summary, this proposal illustrates how single and mixture PPDs can be defined in their 'native' format while providing a reasonable degree of interoperability with PPD-ignorant receivers. It also supports PPD clipping and the ability to specify confidence intervals. I believe this will provide an excellent foundation for sophisticated applications that use PPDs without unduly burdening recipients that are PPD-ignorant.

Regards,

Paul Schluter GE Medical Systems Information Technologies

office: (414) 362-3610 fax: (414) 362-2880 email: Paul.Schluter@med.ge.com


Remaining Issues:

1. Including the ppdType attribute in the parent PPD_PQ element appears to be somewhat redundant, since the ppdType can be determined from the <ppd_* child elements. Should this attribute be retained in the parent element?

2. For the PPDs 'U', 'TRI' and 'TRP', I have consistently used "a" and "b" to represent the min and max, and then "c" and "d" are used to further refine the shape of the distribution. Is the non-consecutive order of a, b, c, d acceptable? (I have seen both formats used in the literature.)

3. Using the 'native' parameters for PPDs _and_ requiring the sender to provide the overall mean/mode (value) and standard deviation (stdDev) makes it a lot easier to add new PPDs in the future. Should we use more expressive ppdType codes so that PPD codes we define in the future don't conflict with existing codes?

4. Should we explicitly indicate that the reported 'value' is the mean, mode or median?


Footnotes:

[A] Lines are prefaced by a single quote to preserve indenting, since leading spaces are removed by the HL7 email server.

[B] For this email, I will use longer, more descriptive attribute names for V, SD and TY PPD attributes defined in Ballot #1, but not quite as long as those defined in Ballot #3.


> -----Original Message----- > From: Gunther Schadow [1] > Sent: Wednesday, June 19, 2002 7:54 PM > To: Schluter, Paul (MED, GEMS-IT) > Cc: Mead Walker; Biron,Paul V; Lloyd McKenzie/CanWest/IBM; 'HL7-CQ'; > 'hl7-xml' > Subject: Re: HL7 PPD representation > > > Paul I thank you for the revisions you made to the table. > This is definitely going to replace the table we have now. > > I also agree more and more to your desire to specify the > natural parameters rather than just the generic parameters. > I can see how the crowd who would really be using the > distributions would like that much better. I am still > concerned about the ignorant receiver. Those people get > totally scared off by those formulas. You cannot ask some > interface engin guy to do any calculations of that kind. > On the other hand, you might say, those would have no > business with PPDs anyway, and that's probably true also. > > I would like it more if we could either require mu and sigma > in addition or replace the location and scale parameters with > mu and sigma where possible. The nice thing is that mu and > sigma specified as quantities with dimensions give you quite a > bit of the location and scale stuff (mu -> location, sigma > -> scale.) > > The clip interval could handle Lloyd's request. It would be > a guess distribution with clip interval. > > Confidence intervals could also fit in somehow. > > However, at this point what was once seemingly a nice tight > thing with only two parameters has now become a big thing > with many choices and options and requires people to calculate > even for a naive interpretation of the data sent to them. > > If that is what we must do, I will support it. I just > plead with you to help reduce the burdon on the ignorant > receiver even further. Is there a compromise somewhere? > > thanks, > -Gunther

Gunther

Schluter, Paul (MED, GEMS-IT) wrote:

> Gunther, I have taken your concerns to heart regarding the PPD-ignorant > receiver, and agree with you that the sender should always report the > 'value' and the 'standardDeviation'. This seems to be about the only > way to ensure a reasonable degree of interoperability with PPD-ignorant > receivers.


Thanks! This is overall a great proposal with good new ideas.


> This will also give us the freedom to specify new or more sophisticated > PPDs in the future without breaking basic compatibility with receivers > that may not understand the new PPD types or those that are simply just > PPD-ignorant. For example, PPD 'mixtures' (i.e. a PPD that is a > weighted combination of one or more basic PPDs) and confidence intervals > can be easily described, based on the proposal below. > > To accomplish this, we first would re-adopt the XML PPD encoding similar > to the earlier Ballot #1 definition of a PPD, > ' <somePPD_PQ T="PPD_PQ" value="4.5" stdDev="0.1" ppdType="N" > unit="mmol/L"/> > > where the 'value' and standard deviation 'stdDev' are required > attributes that provide the information for the PPD-ignorant receiver > [A], [B]. This is essentially equivalent to the PPD described in HL7 V3 > to date.


I understand that the "value" is now just loosely defined as "the representative value for this distribution." It could be the mean or the median or the mode. Would there be any rules as to what representative value to use in which distribution? Some guidance may be needed to avoid people going totally astray.


The unit for the representative value and the standard deviation may be different. Or one may not have a unit at all. Most notably with a PPD<TS> the value is a TS and has no unit and the standard deviation has some unit comparable to 1 s. Or take the old, true, definition of degree Celsius, where differences are measured in Kelvin.

So, I think that we can take the main points of your proposal and intrgrate them into the post-ballot 2 schema.


> The following sections illustrate how additional child elements can be > used to enhance and extend this simple PPD representation. > > > 1. PPD WITH 'NATIVE' PARAMETERS > > A PPD_PQ with a triangular distribution is shown below: > > ' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.288" ppdType="TRI" > unit="mmol/L"> > ' <ppd_TRI a="4.0" b="5.0" c="4.5"/> > ' </somePPD_PQ>

> The attributes 'value' and 'stdDev' provide interoperability with > PPD-ignorant recipients and the PPD is specified by a child element > using 'native' parameters. The 'unit' attribute applies to both the > 'value' and 'stdDev' as well as the child PPD.


You assume that everything uses the same unit. Alas, in case of a TRI PPD<TS> you would not have units for a, b, and c, but your would for standard deviation.

I would prefer to do the straightforward thing on the abstract specification layer and define a specialization type for each of the distributions. There the parameters would be just normal properties and each would come together with its unit if applicable.

We can then map the distribution type attribute to xsi:type.

> 2. PPD MIXTURE WITH NON-OVERLAPPING PPDs > > Specifying the native PPD as a child element makes it very easy to > define more sophisticated PPDs that include 'mixtures' of continuous as > well as discrete PDFs. For example, a PPD with a central peak and two > sidelobes could be described as a mixture of three Uniform PPDs: > ' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.0953" ppdType="MIX" > unit="mmol/L"> > ' <ppd_U a="4.0" b="4.45" prob="0.045"/> > ' <ppd_U a="4.45" b="4.55" prob="0.91"/> > ' <ppd_U a="4.55" b="5.0" prob="0.045"/> > ' </somePPD_PQ> > > The absolute contribution of each PPD is specified by 'prob' (as in > 'probability') and the sum of the probabilities should be unity.


Is it common terminology to describe the weight in a probability distribution mix as "probability"? Wouldn't "weight" be a less confusing word?

What is the semantics? I assume that if f_1, f_2, f_3, ... are the density functions and w_1, w_2, w_3, ... are the weights we would have

    f = SUM[ f_i * w_i ]
        i


                                                          x

is that corect? Or is it the probability functions F(x) = INT[ f(t) * dt ]

                                                         t=0

that we add in this form? Probably the latter, right?


> The reported 'value' should be consistent with the PPD mixture; in the > example above, the reported 'value' is equal to the mean of the three > Uniform PPDs, and the reported 'stdDev' is also derived from the three > PPDs. The sender could also report the mode or median as the 'value', > independent of the underlying PPDs, and we should allow this.


O.K., I like that. But would we then not have to also specify exactly how they relate, and, most of all, wouldn't we have to make the same rules for simple, non-mixed, distributions?

On the abstract layer the mixed distribution will look like quite a complex thing, but we can thing that beast when it comes to the ITS.


> 3. PPD MIXTURE WITH OVERLAPPING PPDs > > PPDs can also overlap in a 'mixture'. For example, the PPD in #2 could > also be represented as:

> > ' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.0953" ppdType="MIX" > unit="mmol/L"> > ' <ppd_U a="4.45" b="4.55" prob="0.9"/> > ' <ppd_U a="4.0" b="5.0" prob="0.1"/> > ' </somePPD_PQ> > > where the PPDs are stacked above each other (like a wedding cake).


Stacked above? You mean added, right? The functions or the density functions?


> 4. PPD MIXTURE WITH CLIPPING AND CONFIDENCE INTERVALS > > Two additional and optional elements, 'ppd_minClip' and 'ppd_maxClip', > can be used to truncate the range of the random variable for a single or > mixture PPD. In the example below, the Normal distribution is clipped > at 2.58 times the stdDev, so that the area under each tail equals 0.005. > > ' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.1" ppdType="N" > unit="mmol/L"> > ' <ppd_N mu="4.5" sigma="0.1"/> > ' <ppd_minClip a="4.242" prob="0.005"/> > ' <ppd_maxClip b="4.758" prob="0.005"/> > ' </somePPD_PQ>


Hmm, doesn't the prob attribute now have yet a different meaning here? I see, you want to specify the probability of the tail rather than that of the inside the confidence interval. That's like the p-values or alpha values, right? This allows you to clip on only one end for doing single sided statistics, right?

Too sad that we don't actually *see* the confidence-*interval* as an interval. Anything we could do about that? How about this:

- there is a clip interval property with

  low ~ minClip
  high ~ maxClip

- plus the confidence level of the inside the interval, i.e., 99%

 in the above example.

- single sided-ness is indicated by having only one finite boundary

 and setting the other at infinity.

-> advantage of this is that you can take this interval as an interval

  without having to "think." And you can still get the tail
  probabilities.

-> this does imply, however, that if you have a two-sided interval,

  the probabilities of the tails should be the same, or else one
  would have to use the probability function to determine the tails
  which is kind of cumbersome. Is there ever a need to shift a
  confidence interval to one side such that the tails are of
  un-equal sizes?


> Independent of whether this distribution is clipped or not, one could > say that "we are 99% confident that the 'value' ('mu') lies in the > interval [a,b] = [4.242,4.758]" for the distribution shown above. It > would also be possible to specify component PPDs within a mixture to > facilitate easy identification of the '95% percent' and other confidence > intervals.


How would we say that the distribution is clipped vs. isn't? I gather if we say that the confidence interval has two finite bounds and the confidence level is 100% then we have clipped the distribution, right?

That would then cover Lloyd's pharmacy case where he wants the guess- distribution with clipping.


> In summary, this proposal illustrates how single and mixture PPDs can be > defined in their 'native' format while providing a reasonable degree of > interoperability with PPD-ignorant receivers. It also supports PPD > clipping and the ability to specify confidence intervals. I believe > this will provide an excellent foundation for sophisticated applications > that use PPDs without unduly burdening recipients that are PPD-ignorant.


Indeed. Now, since this has grown quite sophisticated, as you say, what can we offer to critics about the current use cases? I GE going to need all these features real soon? I would assume that 95% of the current customer base would not need most of this and a good deal would be scared. So, it's good to have some real stories of the actual business need on hand that can justify why we are doing this now.


> Remaining Issues: > > 1. Including the ppdType attribute in the parent PPD_PQ element appears > to be somewhat redundant, since the ppdType can be determined from the > <ppd_* child elements. Should this attribute be retained in the parent > element?


Given that we start this from the abstract spec, not from the XML, we would probably have the parameters be the properties of specializations and the ppdType is the discriminator for the subtype (dare I say "choice"? No, I won't say "choice" :-) .

This could then easily fall onto xsi:type, it would now suck the distribution type domain into the data type domain. Which doesn't frighten me that much.


> 2. For the PPDs 'U', 'TRI' and 'TRP', I have consistently used "a" and > "b" to represent the min and max, and then "c" and "d" are used to > further refine the shape of the distribution. Is the non-consecutive > order of a, b, c, d acceptable? (I have seen both formats used in the > literature.)


That by itself is no problem. But, why not choose names that are more descriptive? I know in math we use short one-letter variables but in computer stuff this has turned out to be hard to read. (While I find that the reverse is also true, i.e., I don't like it if people use multi-letter symbol names in typeset math formulas, it confuses the hell out of me never knowing if they mean to multiply those individual letters or if they take all together as one symbol.) And of course your a, b, c numbering is the standard for the triangle.

Bla, bla, aside, I think more descriptive names would be nice (aren't there greek names for the points of a triangle?)

For U I'd almost like to describe it using an interval, because that's most naturally what it is.



> 3. Using the 'native' parameters for PPDs _and_ requiring the sender to > provide the overall mean/mode (value) and standard deviation (stdDev) > makes it a lot easier to add new PPDs in the future. Should we use more > expressive ppdType codes so that PPD codes we define in the future don't > conflict with existing codes?


The type codes might end up being more expressive with merging them into the data type identifiers alltogether.

Otherwise I wouldn't be too concerned. The Jonny-come-lately's can always get longer symbols (though I doubt there are too much more left to be added.)


> 4. Should we explicitly indicate that the reported 'value' is the mean, > mode or median?


We should do something more constraining here. Either we would specify what the representative value is for each distribution type (in that case your TRI's "c" would become same as "value", which would reduce the redundancy in the representation a bit.)


> Footnotes: > > [A] Lines are prefaced by a single quote to preserve indenting, since > leading spaces are removed by the HL7 email server.


I never noticed that. Never had problems with that. Are you sure it's not because you are reading email in proportional font? Never ever do that. If you want to read my emails, at least, you need to have an 80 character wide fixed width page, I recommend VT100 terminals :-) .


> [B] For this email, I will use longer, more descriptive attribute names > for V, SD and TY PPD attributes defined in Ballot #1, but not quite as > long as those defined in Ballot #3.


I wouldn't be so worried about those. The current style is to use the same property names in the XML that we use in the abstract spec. The insight is that all abbreviations are bad (ever tried to work with the UMLS -- an abbreviative nightmare!) If we are concerned about space, we shouldn't use XML (or we can do some post-processing.)

cheers, -Gunther

Paul

Tuesday, June 25, 2002,

Gunther,

Thank you for your comments and your favorable reception of this proposal. I have tried to respond to most of the comments and questions you raised, but a few of them will require a little more thought ...


> -----Original Message----- > From: Gunther Schadow [2] > Sent: Tuesday, June 25, 2002 10:40 AM > To: Schluter, Paul (MED, GEMS-IT) > Cc: Mead Walker; Biron,Paul V; Lloyd McKenzie/CanWest/IBM; 'HL7-CQ'; > 'hl7-xml' > Subject: Re: HL7 PPD representation > > > Schluter, Paul (MED, GEMS-IT) wrote: > > > Gunther, I have taken your concerns to heart regarding the PPD-ignorant > > receiver, and agree with you that the sender should always report the > > 'value' and the 'standardDeviation'. This seems to be about the only > > way to ensure a reasonable degree of interoperability with PPD-ignorant > > receivers. > > > Thanks! This is overall a great proposal with good new ideas. > I'm glad you like it!

> > > This will also give us the freedom to specify new or more sophisticated > > PPDs in the future without breaking basic compatibility with receivers > > that may not understand the new PPD types or those that are simply just > > PPD-ignorant. For example, PPD 'mixtures' (i.e. a PPD that is a > > weighted combination of one or more basic PPDs) and confidence intervals > > can be easily described, based on the proposal below. > > > To accomplish this, we first would re-adopt the XML PPD encoding similar > > to the earlier Ballot #1 definition of a PPD, > > ' <somePPD_PQ T="PPD_PQ" value="4.5" stdDev="0.1" ppdType="N" unit="mmol/L"/> > > > where the 'value' and standard deviation 'stdDev' are required > > attributes that provide the information for the PPD-ignorant receiver > > [A], [B]. This is essentially equivalent to the PPD described in HL7 V3 > > to date. > > > I understand that the "value" is now just loosely defined as "the > representative value for this distribution." It could be the mean > or the median or the mode. Would there be any rules as to what > representative value to use in which distribution? Some guidance > may be needed to avoid people going totally astray. > The use of mean, median or mode can also depend on how the "value" will be used by the recipient. For example, an airline would be interested in the "average" baggage weight, since they are concerned about the total weight. The "mode" (most likely) value would be appropriate for a device or instrument that reports a single value to a clinician. We need to think about this a little more.

> > The unit for the representative value and the standard deviation > may be different. Or one may not have a unit at all. Most notably > with a PPD<TS> the value is a TS and has no unit and the standard > deviation has some unit comparable to 1 s. Or take the old, true, > definition of degree Celsius, where differences are measured in > Kelvin. > I am not convinced that different units are required for the "value" and "stdDev"; couldn't we insist that the same units be used for both? In the case of a PPD<TS>, there is no ambiguity regarding the location (or mu+location) of a PPD along the random variable (time, in seconds, relative a known and agreed to epoch) and the standard deviation (again, in seconds). In the case of degrees Celsius, which has an offset relative to degrees Kelvin, the mean/median/mode/PPD_location would be relative to the 'zero' defined for either Celsius, and the scale would still be in Celsius. I suspect I may be digging myself into a hole here, but one could say that the value "value" and the PDD are unitless entities, and are assigned the same units when "unit" is specified for the physical quantity.


> So, I think that we can take the main points of your proposal and > intrgrate them into the post-ballot 2 schema. > What are the next steps that I should take, once we resolve most of the issues that you and others have raised?


> > The following sections illustrate how additional child elements can be > > used to enhance and extend this simple PPD representation. > > > > 1. PPD WITH 'NATIVE' PARAMETERS > > > A PPD_PQ with a triangular distribution is shown below: > > > ' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.288" ppdType="TRI" unit="mmol/L"> > > ' <ppd_TRI a="4.0" b="5.0" c="4.5"/> > > ' </somePPD_PQ> > > > The attributes 'value' and 'stdDev' provide interoperability with > > PPD-ignorant recipients and the PPD is specified by a child element > > using 'native' parameters. The 'unit' attribute applies to both the > > 'value' and 'stdDev' as well as the child PPD. > > > > You assume that everything uses the same unit. Alas, in case of a TRI > PPD<TS> you would not have units for a, b, and c, but your would for > standard deviation. > I believe a, b and c would have the same units as "value", since they all share the same random-variable axis, which has units of "mmol/L".


> I would prefer to do the straightforward thing on the abstract > specification layer and define a specialization type for each of > the distributions. There the parameters would be just normal > properties and each would come together with its unit if > applicable. > > We can then map the distribution type attribute to xsi:type. > Again, I believe that the child <ppd_* ... /> should be treated as unitless entities that inherit the units from the parent PPD_PQ. Just think of them as being a fuzzy number that could replace the the numeric value stored in the "value" attribute.


> > 2. PPD MIXTURE WITH NON-OVERLAPPING PPDs > > > Specifying the native PPD as a child element makes it very easy to > > define more sophisticated PPDs that include 'mixtures' of continuous as > > well as discrete PDFs. For example, a PPD with a central peak and two > > sidelobes could be described as a mixture of three Uniform PPDs: > > ' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.0953" ppdType="MIX" unit="mmol/L"> > > ' <ppd_U a="4.0" b="4.45" prob="0.045"/> > > ' <ppd_U a="4.45" b="4.55" prob="0.91"/> > > ' <ppd_U a="4.55" b="5.0" prob="0.045"/> > > ' </somePPD_PQ> > > > The absolute contribution of each PPD is specified by 'prob' (as in > > 'probability') and the sum of the probabilities should be unity. > > > Is it common terminology to describe the weight in a probability > distribution mix as "probability"? Wouldn't "weight" be a less > confusing word? > 'Weight' would be okay. I have also seen 'relative probability' used, but this could lead to inconsistent implementations.

> What is the semantics? I assume that if f_1, f_2, f_3, ... are the > density functions and w_1, w_2, w_3, ... are the weights we would > have > > > f = SUM[ f_i * w_i ] > > i > > > x > is that corect? Or is it the probability functions F(x) = INT[ f(t) * dt ] > t=0 > > that we add in this form? Probably the latter, right? >

The aggregate PDF(x) would be

   PDF(x) = SUM [pdf_i(x) * w_i]  for i = 1...*, as you have

indicated.


The mean for the aggregate PDF(x) would be

   mean[PDF(x)] = SUM [mean(pdf_i(x)) * w_i]  for i = 1...*


The standard deviation is a bit more complicated, since the contribution of each pdf_i(x) has to be evaluated relative to mean[PDF(x)] (or relative to x=0, and then subtract (mean[PDF(x)])2. We would need to provide an additional equation for each PPD type to calculate the standard deviation of the aggregate PDF(x).


> > The reported 'value' should be consistent with the PPD mixture; in the > > example above, the reported 'value' is equal to the mean of the three > > Uniform PPDs, and the reported 'stdDev' is also derived from the three > > PPDs. The sender could also report the mode or median as the 'value', > > independent of the underlying PPDs, and we should allow this. > > > O.K., I like that. But would we then not have to also specify exactly > how they relate, and, most of all, wouldn't we have to make the same > rules for simple, non-mixed, distributions? > I believe it would be nice to specify whether the mean, mode or median is used, but then again, it also depends on how the representative value will be used. For example, an airline would be interested in the "average" baggage weight, since they are principally concerned about the total weight. On the other hand, the "mode" (most likely) value would be appropriate for a device or instrument that reports a single value to a clinician.


> On the abstract layer the mixed distribution will look like quite a > complex thing, but we can thing that beast when it comes to the ITS. > We could always start by specifying only a single child <ppd_* ... />, with a footnote that indicates that additional child <ppd_* ... /> would be allowed at a future data. I believe that users of PPDs will like the ability to specify an arbitrary PPD, and have a mechanism in place for adding more PPDs in the future.


> > 3. PPD MIXTURE WITH OVERLAPPING PPDs > > > PPDs can also overlap in a 'mixture'. For example, the PPD in #2 could > > also be represented as: > > > > ' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.0953" ppdType="MIX" > > unit="mmol/L"> > > ' <ppd_U a="4.45" b="4.55" prob="0.9"/> > > ' <ppd_U a="4.0" b="5.0" prob="0.1"/> > > ' </somePPD_PQ> > > > where the PPDs are stacked above each other (like a wedding cake). > > > Stacked above? You mean added, right? The functions or the density > functions? > Yes, in both cases the probability density functions add. The point I was making here is that the component <ppd_* ... />s can overlap along the random variable axis.


> > 4. PPD MIXTURE WITH CLIPPING AND CONFIDENCE INTERVALS > > > Two additional and optional elements, 'ppd_minClip' and 'ppd_maxClip', > > can be used to truncate the range of the random variable for a single or > > mixture PPD. In the example below, the Normal distribution is clipped > > at 2.58 times the stdDev, so that the area under each tail equals 0.005. > > > ' <somePPD_PQ T='PPD_PQ' value="4.5" stdDev="0.1" ppdType="N" unit="mmol/L"> > > ' <ppd_N mu="4.5" sigma="0.1"/> > > ' <ppd_minClip a="4.242" prob="0.005"/> > > ' <ppd_maxClip b="4.758" prob="0.005"/> > > ' </somePPD_PQ> > > > Hmm, doesn't the prob attribute now have yet a different meaning here? > I see, you want to specify the probability of the tail rather than that > of the inside the confidence interval. That's like the p-values or > alpha values, right? This allows you to clip on only one end for > doing single sided statistics, right? > Yes, it's like alpha if the area under both tails is considered; that's why I was reluctant to use the term 'alpha' since these were single-sided. And, yes, the intent here was to support single-sided statistics.


> Too sad that we don't actually *see* the confidence-*interval* as an > interval. Anything we could do about that? How about this: > > - there is a clip interval property with > > low ~ minClip > high ~ maxClip > > - plus the confidence level of the inside the interval, i.e., 99% > in the above example. > > - single sided-ness is indicated by having only one finite boundary > and setting the other at infinity. > > -> advantage of this is that you can take this interval as an interval > without having to "think." And you can still get the tail > probabilities. > > -> this does imply, however, that if you have a two-sided interval, > the probabilities of the tails should be the same, or else one > would have to use the probability function to determine the tails > which is kind of cumbersome. Is there ever a need to shift a > confidence interval to one side such that the tails are of > un-equal sizes? > For single-sided distributions, supporting the two clipping elements 'ppd_maxClip' and 'ppd_minClip' makes sense. This would allow the areas under the tails to be different, if that was appropriate to the application. Maybe we can get some other biostatisticians to chime in on this one, as well as review the entire proposal?


> > Independent of whether this distribution is clipped or not, one could > > say that "we are 99% confident that the 'value' ('mu') lies in the > > interval [a,b] = [4.242,4.758]" for the distribution shown above. It > > would also be possible to specify component PPDs within a mixture to > > facilitate easy identification of the '95% percent' and other confidence > > intervals. > > > How would we say that the distribution is clipped vs. isn't? I gather > if we say that the confidence interval has two finite bounds and the > confidence level is 100% then we have clipped the distribution, right? > Yes.

> That would then cover Lloyd's pharmacy case where he wants the guess- > distribution with clipping. > I believe this would, but we would still insist on a well-defined shape for the probability density function, even if it is a 'guess' of some sort.

> > > In summary, this proposal illustrates how single and mixture PPDs can be > > defined in their 'native' format while providing a reasonable degree of > > interoperability with PPD-ignorant receivers. It also supports PPD > > clipping and the ability to specify confidence intervals. I believe > > this will provide an excellent foundation for sophisticated applications > > that use PPDs without unduly burdening recipients that are PPD-ignorant. > > > Indeed. Now, since this has grown quite sophisticated, as you say, > what can we offer to critics about the current use cases? I GE > going to need all these features real soon? I would assume that 95% > of the current customer base would not need most of this and a > good deal would be scared. So, it's good to have some real stories > of the actual business need on hand that can justify why we are > doing this now. > Although it is difficult to pin a date on when we will need this, providing a standard way of representing uncertainty will certainly facilitate research and development in this area. In the long run, clinicians, researchers, patients and our industry will all benefit.


> > > Remaining Issues: > > > 1. Including the ppdType attribute in the parent PPD_PQ element appears > > to be somewhat redundant, since the ppdType can be determined from the > > <ppd_* child elements. Should this attribute be retained in the parent > > element? > > > Given that we start this from the abstract spec, not from the XML, > we would probably have the parameters be the properties of > specializations and the ppdType is the discriminator for the subtype > (dare I say "choice"? No, I won't say "choice" :-) . > > This could then easily fall onto xsi:type, it would now suck the > distribution type domain into the data type domain. Which doesn't > frighten me that much. > > > > 2. For the PPDs 'U', 'TRI' and 'TRP', I have consistently used "a" and > > "b" to represent the min and max, and then "c" and "d" are used to > > further refine the shape of the distribution. Is the non-consecutive > > order of a, b, c, d acceptable? (I have seen both formats used in the > > literature.) > > > That by itself is no problem. But, why not choose names that are > more descriptive? I know in math we use short one-letter variables > but in computer stuff this has turned out to be hard to read. (While > I find that the reverse is also true, i.e., I don't like it if > people use multi-letter symbol names in typeset math formulas, it > confuses the hell out of me never knowing if they mean to multiply > those individual letters or if they take all together as one symbol.) > And of course your a, b, c numbering is the standard for the triangle. > > Bla, bla, aside, I think more descriptive names would be nice > (aren't there greek names for the points of a triangle?) > > For U I'd almost like to describe it using an interval, because that's > most naturally what it is. > That's where all of this started -- a desire to represent a singular value that is reported to a clinician as well as the upper and lower bounds. On the other hand, it really is a PPD, especially if one wants to model an instrument or system.

> > > 3. Using the 'native' parameters for PPDs _and_ requiring the sender to > > provide the overall mean/mode (value) and standard deviation (stdDev) > > makes it a lot easier to add new PPDs in the future. Should we use more > > expressive ppdType codes so that PPD codes we define in the future don't > > conflict with existing codes? > > > The type codes might end up being more expressive with merging them > into the data type identifiers alltogether. > > Otherwise I wouldn't be too concerned. The Jonny-come-lately's can > always get longer symbols (though I doubt there are too much more > left to be added.) > Okay.


> > 4. Should we explicitly indicate that the reported 'value' is the mean, > > mode or median? > > We should do something more constraining here. Either we would > specify what the representative value is for each distribution > type (in that case your TRI's "c" would become same as "value", > which would reduce the redundancy in the representation a bit.) > I'll have to think about this one.

> > > Footnotes: > > > [A] Lines are prefaced by a single quote to preserve indenting, since > > leading spaces are removed by the HL7 email server. > > > I never noticed that. Never had problems with that. Are you > sure it's not because you are reading email in proportional > font? Never ever do that. If you want to read my emails, > at least, you need to have an 80 character wide fixed width > page, I recommend VT100 terminals :-) . > Yes, it occurred with an email with a proportional font. So now I will only use a fixed-width font.

> > > [B] For this email, I will use longer, more descriptive attribute names > > for V, SD and TY PPD attributes defined in Ballot #1, but not quite as > > long as those defined in Ballot #3. > > > I wouldn't be so worried about those. The current style is to use > the same property names in the XML that we use in the abstract > spec. The insight is that all abbreviations are bad (ever tried to > work with the UMLS -- an abbreviative nightmare!) If we are concerned > about space, we shouldn't use XML (or we can do some post-processing.) > > cheers, > -Gunther >

Thanks and regards,

Paul Schluter


Links

Back to Data Types R2 issues