The goal of the Canonical Form is to describe criteria and algorithm how to define equivalence
on XML documents that are "logically" the same and expose just
differences in physical form (entities, attribute order, character encoding)
Canonication "wipes-out" differences that are not significant for applications.
Canonication in inevitable in some important applications such as information security, e.g. electronic
signature of XML data (when calculating digest).
Canonical Form - principles
Main principles for constructing the canonical form of an XML document:
encoding in UTF-8
line breaks (CR, LF) normalized according to the algorithm mentioned in XML 1.0 Spec.
attribute values normalized
references to character and parsed entites replaced by their content
CDATA section also replaced by their content
prolog xml and DTD reference removed
Canonical Form - principles (contd)
whitespaces outside of the root element normalized
otherwise (except of line breaks), the whitespaces are preserved
attribute values always in double quotes "
special chars in attr. values replaced by refs to character entities
superflous NS declarations removed
default attribute values added to all element where relevant
attributes and NS declarations will be ordered lexikographically
Issues with Canonical Form
Certain information loss (mostly info from DTD):
not-parsed entity (eg. binary ones) are not accessible anymore after canonicalization