Schema based code generation
Contents
Summary
Code generation is a process whereby the source code (in a particular programming language) is automatically generated. It is an example of Model Driven Software Development. When it comes to code generation the best (most complete) code should be generated from the MIF. Code Generation tends to be a mechanism for the creation of most RIMBAA applications.
An alternative code generation method is MIF based code generation. The (dis-)advantages of Schema-based code generation versus MIF based code generation include:
- An advantage of schema based code generation is the wide availability of tools. MIF based code generators exist for Java and .net - but the choice is much more limited.
- A serious disadvantage of schema based code generation is the fact that the XML schema language isn't powerful enough to express all of the constraints as contained in the MIF. The MIF contains the full details of the HL7 v3 model. The XML schema of the HL7 v3 model is derived from the MIF - with a loss of a significant amount of detail.
- Note: (November 2010) XML Schema 1.1, a yet to be finalized W3C specification does support many of the desired features which make it more suitable than XML Schema 1.0 for expressing the v3 model requirements. It has yet to be determined whether or not most XML tools support version 1.1 - that would be a prerequisite for HL7 to start generating XML Schema 1.1. The current HL7 v3 schema are based on XML Schema 1.0.
- If it is a design goal to validate templates in software (as opposed to XML-based template validation, e.g. using Schematron): templates are expressed in MIF, and not in the form of Schema. As such one can't use schema based code generation for templates.
Choice of ITS
When it comes to XML schema one has the choice between 2 ITSs:
- The XML ITS (v1.1), in use since the inception of HL7 v3, where clone-name based schema are generated for each and every R-MIM/CIM.
- The RIM ITS, defined in 2010, where one single RIM-based schema (with about 50 classes) is used for all RIM-based object instances.
From the viewpoint of code generation the XML ITS schema are much more specific than the RIM ITS schema. Currently (2010) all known code generation projects are based on the XML ITS. Neither set of schema is able perform a complete validation.
XML ITS 1.1
This section assumes that one uses the XML ITS 1.1 schema (and not the new RIM ITS schema).
- Note: schema based code generation is discussed in detail in the following tutorial: Implementation Mechanics (PPT). The tutorial has a Creative Commons license.
Optimize the schema
The XML schema (as published by HL7) aren't optimized for code generation. Schema serve more than one purpose: design, validation, contract and code generation. These purposes often require different Schema. Prior to performing the code generation process one should transform the schema to optimize them for code generation and code re-use.
The following are the main optimization methods used prior to the code generation process:
- Flatten the schema. Remove all includes from the schema and create one single schema file.
- Simplify the data types.
- Simplify the datatypes.xsd schema, by removing all unnecessary [for code generation] hierarchies from the definition, and removing all features from the data type definition that won't be used in the context of a particular [code generation] project.
- Replace all HL7 v3 data types that have a direct (functional) equivalent in the XML schema language with their equivalents. The generated code will be smaller, and won't reference the hierarchical data type definition as defined by HL7.
- Examples: replace ST and CS with xsd:string, and TS with xsd:date.
- Replace element names and attribute names by more 'readable' names.
For readability: make the Schema resemble the instance - Readable Schema generate readable code. The schema are full of type names that are automatically generated by HL7 tooling. Disadvantage of this step could be that upon serialization of an object tree one has to transform the element/attribute names back to their original names as present in the published HL7 v3 schema. - Replace all CMET flavours by a generic CMET flavour.
- If one has to support multiple v3 interactions it is likely that multiple flavors (variations with different levels of richness when it comes to attributes and classes) of on and the same CMET are being used. Prior to code generation one could replace the various flavors with the most generic flavor (known as the universal flavor) of the CMET. This process is possible because all CMET flavors are constrained versions of the universal flavor.
- This approach has the advantage of increased re-use of code; all CMET flavors can be processed by code generated based on the universal flavor. A disadvantage is the fact that when one uses the generated code to create/encode a v3 instance, one will have to create additional code to ensure that the serialization complies with the constraints of the CMET flavor used in the original interaction. This approach is therefore best used for the interactions one receives. Given that a typical application receives far more messages (different message types) than it sends (e.g. lots of systems receive ADT messages and send near to nothing) this may be a worthwile strategy for an implementer to pursue. Additional coding will be needed upon (or: prior to) serialisation to ensure that only those parts allowed by the original CMET flavor end up in the XML instance.
- Example: replace all usage of the schema for the R_Patient[identified] and R_Patient[identified/confirmable] CMET by the R_Patient[universal] schema.
- If one has to support multiple v3 interactions it is likely that multiple flavors (variations with different levels of richness when it comes to attributes and classes) of on and the same CMET are being used. Prior to code generation one could replace the various flavors with the most generic flavor (known as the universal flavor) of the CMET. This process is possible because all CMET flavors are constrained versions of the universal flavor.
Many of the above optimization steps can be dealt with by an automated process, i.e. by means of a XSLT transform of the XML schema as published by HL7.
Improve the level of code re-use
Suppose one has to generate code for 10 different (but related) HL7 v3 interactions. Each of those interactions consists of two wrappers (Transmission Wrapper and ControlAct Wrapper) and may reference a number of CMETs. If one doesn't optimize for code re-use each and every interaction schema will produce code for the wrappers. In order to improve the level of code re-use the following approaches could be taken:
- Generate individual schema for generic model elements prior to code generation
- Instead of generating code based on the interaction schema: create a new set of XML-schema for the constituent parts of the interaction schema, i.e. a Transmission Wrapper schema with an xs:any payload (and xs:any wherever a CMET is being used); a ControlAct wrapper schema with an xs:any payload (and xs:any wherever a CMET is being used); a Payload model schema (and xs:any wherever a CMET is being used); and CMET schema. Generate code for these schema. When using the generated code one has to link/use the appropriate blocks of code whenever one encounters the equivalent of an xs:any in the generated code.
- Given the example above: the 10 related interactions may be based on two different Message Wrappers, 3 different ControlActs, 7 different Payload models, and 10 CMETs. That's a considerable level of code re-use.
- Tooling hint: in order to detect similarities/differences between different (versions of) schema, see this description of a Schema Diff tool.
- Detect overlapping bits of code after generating code
- After having generated the source code one could try and detect (using features of the programming platform, or using a string comparison tool) duplicate (or: similar) code. Some programming platforms (e.g. Java) allow the programmer to create 'virtual frontend classes' which are linked/associated with the entire set of duplicate class structures.
- Using the above example: the generated code for the 10 interactions would have overlapping class definitions for the wrappers, for the CMETs, and for parts of the Payload model.
The method most often used to optimize the schema seems to be method 1, with optional use of method 2 for Payload models (to optimize the code generated by the Payload schema).
RIM ITS
This section assumes that one uses the RIM ITS schema (and not the older XML ITS schema).
The schema associated with the RIM ITS is very generic (there is essentially one core schema with 50 class definitions) which means one has to rely on the presence of "template names" in the instance in order to match the classes to the definitions in implementations such as CDA or v3 messaging specifications. Note that the template names include both classic R-MIM Clone names and also template IDs. (CIM Names, to be precise).
In the case of the RIM ITS, the schema does not perform validation based on the fixed values defined in the applicable R-MIMs; instead some other technique is required (either schematron or software [note: Eclipse OHF code has all the pieces and will be adapted to do this]). This isn't necessarily a problem for generated code - it depends how much validation is appropriate.
One big point of the RIM ITS is to perform schema based code generation - there will be one set of generated classes - those in the RIM.
- Any RIM ITS instance loads into the same set of classes using a standard XML engine
- There are multiple XML ITS implementations to do this (i.e. Java SIG, Mohawk, Eclipse - and these aren't needed with the RIM ITS.
- As the RIM ITS is new, for now there will be a need to interconvert between the RIM ITS and the XML ITS.
- Eclipse will provide a jar that performs these conversions. an XSLT could be envisaged too.
- MIF based code generation could still be useful with the RIM ITS, because it offers a layer that links the RIM classes of the RIM ITS with the clone definitions from the MIF.