Understanding The SGML Declaration

Understanding The SGML Declaration

[Previous Chapter] [Table of Contents]


2. Character Sets in the SGML Declaration

This chapter explains what character sets are, how they are identified and defined in an SGML Declaration, and how they are used in an SGML document.

2.1. Character Sets

A character set is a set of numbers, each of which corresponds to a character. Most commonly, there are either 128 or 256 different character numbers in a character set, and each character has a role assigned to it. For example, in the ASCII character set, character number 65 is the upper-case letter "A". In the EBCDIC character set, on the other hand, the role of the upper-case letter "A" is assigned to character number 129. These distinctions do not usually affect most users of computers, because computer software does things like ensuring that when the "A" key is typed at a keyboard (together with holding down the shift key), an upper-case A appears in the right spot on the screen. These distinctions do become important in a few cases, such as:

  1. when a file is received that was created on a computer that assigned different roles to character numbers;

  2. when a character is required that is not normally supported by a particular computer, such as, in many cases, accented and non-Latin letters; and

  3. when one needs to specify that a control character, such as backspace, that does not normally appear in stored documents, is to be allowed.

2.2. Using Character Sets In The SGML Declaration

One of the main jobs of the SGML Declaration is to describe the character set used in the prolog and the document instance. Most SGML Declarations simply select a well-known character set, such as ISO 646, the international version of ASCII, and specify that characters are to have the meanings defined by that character set.

SGML systems usually provide a default SGML Declaration, that in most cases uses ISO 646. If this character set, and the other specifications provided by the system (including the default or "reference" values for the delimiters, name characters and quantities) are appropriate for a document, no SGML Declaration need be provided. If some variation from the default is required, an SGML Declaration must be specified that includes a definition of the character set used in the document.

There are, in fact, two character sets defined in an SGML Declaration:

  1. the "document" character set, which is what has been described above, and

  2. the "syntax-reference character set", which is used in defining the "concrete syntax" (the delimiters, characters allowed in names, etc.).

If a default concrete syntax is used, such as the "Reference Concrete Syntax" or the "Core Concrete Syntax", the syntax-reference character set is supplied by the default concrete syntax. Otherwise it has to be defined in the SGML Declaration, just like the document character set. The syntax-reference character set is kept separate from the document character set so that the concrete syntax can be defined independently of the document character set. For example, the start-tag open delimiter can be defined once and then used with many character sets: if the start-tag open delimiter is defined as "<" in the concrete syntax, then it will be whatever character in the document character set is assigned the role of the character "<".

Both the document character set and the syntax-reference character set are defined in terms of one or more "base character sets". The main way a character in the document or syntax-reference character set is assigned a role is to associate its character number with the character number of a character in a base character set. Other ways of assigning roles are:

  1. to associate a string with a character number, which simply marks the character as an allowed character with no other special meaning to the SGML parser, and

  2. (for the syntax-reference character set) to associate a character number with a function role, such as record-end or tab.

If a string is associated with a character number, its text has no meaning to the SGML parser, other than to indicate that the character number is that of an allowed data character. The text is, however, intended to be meaningful to human readers of the SGML Declaration.

Characters in the syntax-reference character set acquire meanings either by being given a meaning explicitly, or by inheriting a meaning from the base character set. For example, the Latin letters and the Arabic digits are roles assigned to character numbers only by a base character set.

The meanings assigned to characters in the syntax-reference character set are transferred to the corresponding characters in the base character set used. In turn, meanings are transferred from a base character set to a document character set.

In hindsight it is clear that this system of assigning roles to characters is overly complex. It would have been simpler if ISO 8879, the International Standard that defines SGML, allowed the concrete syntax to be defined in terms of the document character set. This would necessitate changes whenever a new document character set was defined, but this does not happen often, and the number of changes would be small in most cases.

2.3. Public Identifiers

Public identifiers are used in a number of places in an SGML Declaration. Clause 13 of ISO 8879, describes where public identifiers can appear, and, to some extent, describes what they must look like and what they mean. A public identifier is used in an SGML Declaration for one of three purposes:

  1. to select a predefined concrete syntax (delimiters, characters allowed in names, etc.),

  2. to select a predefined set of capacity settings (maximum number of elements allowed etc.), or

  3. to select a base character set.

When the FORMAL feature of SGML is enabled, public identifiers are required to conform to the syntax of a "formal public identifier". A formal public identifier is composed of four major parts:

  1. the identity of the owner or supplier of the text in the file or system described by the public identifier, or the identity of the body (usually an international or national standards body) that has registered the text;

  2. the nature of the text (document type definition, text, character set, etc.);

  3. the name of the text; and

  4. information regarding how the text is coded, or how it is to be used.

For example, the formal public identifier "ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" identifies text:

  1. defined in the 1983 version of international standard ISO 646,

  2. which describes a character set,

  3. called "International Reference Version (IRV)", and

  4. is identified by the designating sequence "ESC 2/5 4/0".

2.4. Identifying Character Sets

Clause 10.2 of ISO 8879 describes the format of formal public identifiers, including the exact form that each part of it can take. Clause 13.1.1.1 specifies that a formal public identifier that identifies a base character set must have a type of "CHARSET". Clause 10.2.1.1 provides that "CHARSET" formal public identifiers can be designated as being registered with an ISO registration number, as an alternative to the number of an ISO standard or the name of another body (preceded by "+//" or "-//" in the latter case, depending on whether the identifier is registered for public or limited use).

Clause 10.2.2 specifies that the fourth part of a formal public identifier be either the identification of a language (for example, "EN" for English) or a designating sequence (if the type of the formal public identifier is "CHARSET"). A designating sequence is further described in Clause 10.2.2.4 as one prescribed by ISO 2022, a standard that describes how to identify 7- and 8-bit character sets (7-bit character sets have 128 characters, and 8-bit character sets have 256 characters).

The designating sequences prescribed in ISO 2022 are designed to be embedded in files that use the character sets they designate. ISO 2022 also allows applications and other ISO standards to agree on a designation in some other manner. The designating sequence part of "CHARSET" formal public identifiers is where this agreement is recorded for character sets used by documents processed by an SGML parser.

2.5. Escape Sequences and Designating Sequences

A large part of ISO 2022 is taken up with defining the meaning of sequences of characters that start with the "escape" control character ("ESC"). ISO 2022 supports extended character sets by allowing up to two sets of 32 control characters and up to four sets of 94 or 96 graphic characters (not counting space and the "delete" character in the case of 94-character graphic sets). The control character sets are called C0 and C1. The graphic character sets are called G0, G1, G2 and G3. Larger character sets are made by combining smaller character sets.

Individual characters are specified in ISO 2022 as a pair of numbers separated by a slash or by a name, such as ESC. The first of a pair of numbers gives the value of the upper three or four bits of a 7- or 8-bit character. The second gives the value of the lower four bits. So 2/5 indicates character number 37 ("%") and 4/0 number 64 ("@").

Some of the escape sequences defined in ISO 2022 allow a document to switch between the C0 and the C1, and between the G0, G1, G2 and G3 character sets. This switching is required because there are only so many different character numbers, and to get a larger number of characters, it has to be possible to switch the roles assigned to character numbers. 7-bit character sets can only select one control character and one graphic character set at a time. 8-bit character sets can have two control sets and two graphic sets at a time.

Other sequences are used to associate a predefined character set with a control character set or a graphic character set. These escape sequences are called "designating sequences". In "ESC 2/5 4/0" the second character (2/5) indicates that this escape sequence designates the G0 set, and the third character (4/0) indicates the character set in question (ISO 646 etc.). The C0 and G0 sets are the default sets for a 7-bit character set, and are, by default, assigned to the first 128 characters of an 8-bit character set.

Similarly, in the formal public identifier "ISO Registration Number 109//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 3//ESC 2/13 4/3", 2/13 indicates that a G1 set is being designated (which by default is assigned to the upper part of an 8-bit character set), and 4/3 identifies the particular character set.

The last part of the designating sequence, the character that identifies the character set to be selected, is not defined in ISO 2022 (although it does prescribe the range of values the character can have). Another standard, ISO 2375, describes how a designating sequence is registered, but does not say who does the registration. A designating sequence is registered by a "registration authority", usually an international standards body (such as ISO) or a national one (ANSI, CSA, BSI, DIN).

Some ISO standards define character sets. For example ISO 8859 defines Latin Alphabet Number 1. It is an 8-bit character set and defines both G0 and G1 graphic character sets. It specifies designating sequences for assigning these two sets to the lower part (left part) and the upper part (right part) of the 256-character numbers. The sequences are "ESC 2/8 4/2" and "ESC 2/13 4/1".

Other ISO standards that define character sets do not indicate what the appropriate designating sequences are, usually because no registration authority had registered the character set when the standard was published.

2.6. ISO 646 and ASCII

ISO 646 describes a set of ASCII-like 7-bit character sets. They differ in the assignment of characters to the character numbers used by ASCII for "#", "$", "@", "[", "\", "]", "^", "`", "{", "|", "}" and "~". ISO defines an "International Reference Version" (IRV) of the character set, which assigns the same characters to these numbers except that "(loz)" (a lozenge) is used in place of "$". "$" is not assigned any special meaning during SGML parsing, so its assignment does not affect how it is processed. As a consequence, ASCII (officially called ANSI X3.4-1986) and the IRV of ISO 646 can be used interchangeably.

The graphic characters in the lower part of Latin Alphabet Number 1 are the same as those in ASCII. The upper set of graphic characters in Latin Alphabet Number 1 adds common accented characters, and a few mathematical and publishing symbols.

2.7. What Do Character Set Identifiers Mean?

Public identifiers which identify base character sets in the SGML Declaration must (if the FORMAL feature is in effect) conform to the syntax of formal public identifiers that describe character sets. A base character set known to the SGML parser being used (and known by a particular public identifier) has to be used in defining both the document character set and the syntax-reference character set. This is so that the SGML parser knows which characters in the base character set are letters, digits and "special characters" (those allowed in minimum literals other than letters, digits and white space, such as "/" and "*").

Other than these requirements, the SGML parser is not concerned with the identifiers used for base character sets. The designating sequences, in particular, are only part of the identifier, to be used by human readers of the SGML Declaration to identify the base character sets in question, and to be used by the SGML parser to identify the base character set containing letters, digits and special characters. The designating sequences are not acted on in any other way than being part of the identifier.


[Next Chapter] [Table of Contents] ©Copyright Exoterica Corporation, All rights reserved.