This wiki has undergone a migration to Confluence found Here
<meta name="googlebot" content="noindex">

Safe interpretation of subsets of data

From HL7Wiki
Revision as of 06:48, 30 March 2011 by Rene spronk (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

If (as a receiving application) I only were to process a subset of all of the data received, what minimal subset should that be to ensure that there is no loss of semantics?

Details

More elaborate use-case description:

  1. If I decide to only persist a subset of the data I receive, what subset (in terms of acts, attributes, associations) should one at a minimum persist in order to preserve the (key) semantics?
  2. If I receive a response message, with a large object structure, which parts of that data (in terms of acts, attributes, associations) do I really have to process in order to preserve the (key) semantics?
  • Rene: the answer will likely be 'it depends'. However one could perhaps define an absolute minimal set, and the maximum set (=the entire data set). And yes, negationInd, nulls would defeinitely be part of the answer. For the second use-case, contextConduction would be part of it.

Discussion

  • (Peter Hendler, 20100516) Bob Dolin explained a bit more about safe and unsafe querying of CDA object nets.
    • Many vendors use template IDs to evaluate incoming CDA documents. In effect they have a variable for each data slot. Of course some types of data can be repeated for example, if there is an active problems section, there can be zero to infinity problems.
    • By knowing the data slot they assume they know the context.
    • For example, we will search our database for patients who have Diabetes presumably for decision support or reporting.
    • If we process incoming documents, then we can start at the top or root and navigate down, but assuming we are searching our patient database with maybe a million or more patients, then we cant afford to start at the root and navigate to problem sections just to see if the patient has Diabetes. Much more likely we would search for an Observations that have codes for Diabetes. Assuming we then get a large list of Observations with Diabetes we are faced with a few problems.
      1. Who is the Observation on? Can we assume it’s the subject of record?
      2. Is the Observation saying there is an active problem of Diabetes or might it be saying something else such as: Family History of Diabetes, No active problem of Diabetes, No family history of Diabetes, Diabetes has been ruled out, or any other statements that would logically contain a RIM Observation with a coded term for Diabetes but because of context did not state that the person of record had active Diabetes.
    • Where would you have to search in order to distinguish an Observation that means active Diabetes from one that means one of the above things instead.
    • If you are just searching from a template ID, then you are not taking into consideration any of the possibly modifying factors. It would be safe if the templates were “closed” but all the current ones are “open” which means there might be context that is not even part of the template referred to by the template ID at all.
    • The Observation itself might have a actionNegationInd which indicates that the Observation was not made at all, or it may have a valueNegationInd which indicates that the Observation did occur but there was no Diabetes.
    • The context of the Observation may have been altered somewhere between itself and the root of the document, and even if it wasn’t we still don’t know from the Observation itself, who the subject of the Observation is.
    • Do we have to navigate back though all the ActRelationships to root in order to search for places where the context from the root may have been over ridden? Probably we do. Another problem occurs. How do we interpret the “sections” of the structured body. For example, what if the observation occurs in a section called “Family History”, and to make it more difficult, what if the statement only says that there is a family history of Diabetes, but it doesn’t state in who. In that case the “subject of record” will never have been over ridden with “father of subject” for example. The only indication that this Observation with a value of Diabetes does not refer to the “subject of record” is the fact that it’s in a section called “Family History”.
    • Since CDA’s can be in any language, and since there is no law that you have to name the section “Family History” we have an intractable problem. If there is a LOINC code or some other way to find out that it’s a Family History section then we’re OK, but if not, then the only way of knowing that distal Observations do not refer to the subject of record is to be a human and interpret the language in the title of the section.
    • So there must be “best practices” because in the situation described above there is no way for the machine to know that the Observation of Diabetes does not refer to the subject of record because that context has never been formally over ridden. The point is, in some pathological cases, you can not do any safe query on whether the subject of record had Diabetes and a best practice might insist that all sections have a machine interpretable code so we can always know the Family History section.
    • In summary, to safely query the object net for Observations of Diabetes that are both an active problem and refer to the subject of record, you need to check the immediate context of the Observation (actionNegationInd, subject participations etc). You must know what section the Observation occurs in so you can throw out Family History for example, and you must navigate all the way to the root to either find out the subject of record (assuming the database didn’t pre conduct this information and store it nearer to the Observation) or to find if the subject of record was overwritten.
    • Bob suggested a testing in which one party would create a set of test CDAs and the other party writes the queries. After the queries are run the false positives and false negatives can be tallied.