[Previous Chapter] [Table of Contents]
This chapter describes the syntax of an SGML Declaration, and what each of its parts mean.
The simplest form of an SGML Declaration is:
<!SGML "ISO 8879:1986"
CHARSET
BASESET "ISO 646-1983//CHARSET International Reference
Version (IRV)//ESC 2/5 4/0"
DESCSET
0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
CAPACITY PUBLIC "ISO 8879-1986//CAPACITY Reference//EN"
SCOPE DOCUMENT
SYNTAX PUBLIC "ISO 8879-1986//SYNTAX Reference//EN"
FEATURES
MINIMIZE DATATAG NO OMITTAG NO RANK NO
SHORTTAG NO
LINK SIMPLE NO IMPLICIT NO EXPLICIT NO
OTHER CONCUR NO SUBDOC NO FORMAL NO
APPINFO NONE
>
The SGML Declaration is made up of seven parts:
the descriptive header, which must be a minimum literal with the value "ISO 8879:1986";
the "document character set" which is introduced by the keyword CHARSET;
the "capacity set", introduced by CAPACITY;
the "concrete syntax scope", introduced by SCOPE;
the "concrete syntax", introduced by SYNTAX;
the "feature use", introduced by FEATURES; and
the "application-specific information", introduced by APPINFO.
The descriptive header identifies the version of the SGML standard to which the document containing the SGML Declaration conforms. There is at present only one version of the standard, known as ISO 8879:1986 (i.e. International Organization for Standardization standard number 8879, version published in 1986).
Each of the other parts of the SGML Declaration are described in detail in the following sections.
As in other declarations in an SGML document, spaces, tabs, line breaks and comments are used to separate the keywords, numbers and literals in an SGML Declaration. Comments start and end with the "COM" delimiter ("--"). The SGML Declaration itself is always coded using the Reference Concrete Syntax. Any change to delimiters specified in the SGML Declaration do not take effect until after the end of the declaration.
As described in the chapter on Character Sets in the SGML Declaration, the meanings of characters in an SGML document are defined using three types of character sets:
The "syntax-reference character set" is used to assign meanings to character numbers.
A correspondence is defined between characters in the syntax-reference character set and characters in a "base character set", which transfers the meanings of syntax-reference characters to corresponding base characters.
A correspondence is then defined between characters in one or more base character sets and the "document character set". One of the base character sets must have been defined by being associated with a syntax-reference character set in a previous step.
The document character set is what defines what characters can be used in the markup and text of a document and defines what those characters mean.
On most computer systems, there is a single character set used by all documents. The upper-case letter "A", for example, is always stored in the computer, and on external devices such as disks, in the same way. For most systems the character set used is the one commonly known as ASCII, as defined by ANSI standard X3.4 and ISO 646. For these systems it is usual to use ISO 646 as the base character set when defining both the document character set and the syntax-reference character set. In this case the only thing that needs to be done in the document character set part of the SGML Declaration is to indicate those characters that are to be allowed in the document, and those which are not (such as control characters that have no meaning in the document).
If a document is coded in some character set other than the base character set, for example, if an EBCDIC document is to be processed by an SGML parser that supports ISO 646 as a base character set, then the document character set has to describe a more complex mapping. In such documents the letter "A" may be represented by a different code than is usual. (If the document is converted from the "other" character set to the base character set before it is processed by the SGML system, then this complex mapping is not needed. For example, it is often easiest to convert EBCDIC documents to ASCII before running them on an ASCII-based system.) A later section describes how to code an EBCDIC-supporting document character set on an SGML system that supports ISO 646.
The document character set part of the SGML Declaration selects one or more base character sets and defines the correspondences between characters in the base character sets and the document character set. The selection of base character sets for defining the syntax-reference character set is done in the "SYNTAX" part of the SGML Declaration.
The document character set and the syntax-reference character set are coded in the same manner, but there are different constraints on each of the two character sets. The coding of the document character set is described here. The coding of the syntax-reference character set is described in a later section.
The following is a simple example of a document character set:
CHARSET
BASESET "ISO 646-1983//CHARSET International Reference
Version (IRV)//ESC 2/5 4/0"
DESCSET
0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
A character set is described in the SGML Declaration by providing a base character set (BASESET), identified by a public identifier, followed by a "described character set portion" (DESCSET).
The character numbers in the first column in the described character set portion are in the document character set. The numbers in the third column are in the base character set. Each document character is assigned the meaning of the corresponding base character. The numbers in the second column indicate the number of adjacent document characters that are to be assigned meanings from adjacent base characters. For example, the line in the above that reads "0 9 UNUSED" makes the nine document characters with numeric codes 0 to 8 non-SGML characters, and the line that reads "9 2 9" assigns the meanings of base characters 9 and 10 to document characters 9 and 10 (which are TAB and RS in the Reference Concrete Syntax).
Character numbers in the SGML Declaration are decimal numbers which are the numbers of the corresponding characters. The ISO method of writing character numbers (high-order eight bits, slash, low-order eight bits) is not used in the SGML Declaration except for within designating sequences in formal public identifiers for character sets.
UNUSED in the third column indicates that a character is assigned no meaning and is a "non-SGML" character. Character numbers not defined are also non-SGML characters. In the above example, characters 128 to 255 are assigned no meaning. Only characters that are assigned meanings in the document character set can be used in the document that follows the SGML Declaration, so UNUSED has the effect of excluding a character from a document.
A minimum literal can also appear in the third column. The meaning assigned in this case is that of a non-significant SGML data character (i.e. a character that is allowed in text in a marked-up document, but which does not have any assigned role: it is not a name character or part of a delimiter). For example, if the following line were added to the above declaration, then it would support the high-order ASCII characters as data characters:
128 128 "High-order characters"
If character 255 is to be defined as a non-SGML character, the following two lines would be added instead:
128 127 "High-order characters"
255 1 UNUSED
The following document character set exchanges the upper- and lower-case letters in the document character set. The lower-case letters will be treated as upper-case and vice-versa. If NAMECASE GENERAL is YES, for example, all general names will be converted to lower-case in the document character set.
CHARSET
BASESET "ISO 646-1983//CHARSET International Reference
Version (IRV)//ESC 2/5 4/0"
DESCSET
0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 33 32
65 26 97 -- Upper-case letters --
91 6 91
97 26 65 -- Lower-case letters --
123 4 123
127 1 UNUSED
The document character set defines what characters in the document have what meanings from the syntax-reference character set (via a base character set), but does not change characters in the document. For example, the document character set above exchanges the meanings of upper- and lower-case letters, it does not actually replace characters in a document being parsed.
Character roles which are not assigned in the SGML Declaration, such as "UC Letter" (i.e. "A" through "Z"), are assigned to the characters in the base character set that the base character set itself considers to have those roles. For this reason, the base character set must be known to the SGML system processing the SGML Declaration. This does not in any way restrict the character set used in a marked-up document, because that is defined by the document character set.
The base character set used to define these character roles not defined in the SGML Declaration must be the same as the one used to define the syntax-reference character set so that its roles can be assigned to document characters. In other words, the same base character set must be used to define both the document and syntax-reference character sets so that they both agree which character is the upper-case letter "A", for example.
Additional base character sets can be used in defining the document character set, so long as the character meanings assigned from them are only data character. Additional base character sets are specified by entering another "BASESET" followed by another "DESCSET" that describes the correspondence between document characters and the characters in the base character set.
All "significant" SGML characters in a base character set must be assigned meanings in the document character set. Significant characters are those that have intrinsic significance (such as letters and digits), and those that are assigned meanings from the syntax-reference character set defined in the concrete syntax (i.e. the "SYNTAX" part of the SGML Declaration). This includes all upper- and lower-case name characters, all digits, all function characters (such as RE, RS and SPACE), and all characters that appear in General and Short Reference Delimiters.
No "shunned" character in the base character set (as defined by the concrete syntax), that is not a significant character, can be assigned to a document character. This is why the only control characters assigned to document characters in the above example are RE, RS and TAB. All control characters are shunned in the Reference Concrete Syntax, and these three control characters are the only ones made significant by the Reference Concrete Syntax. Note that RE, RS and TAB are significant even though they are shunned.
The meaning of each significant character in a base character set used to define a document character set must be assigned to one, and only one, character in the document character set. The meaning of a non-significant base character can be assigned to no, one or more than one character in the document character set. The characters can be defined in any order: the numbers in the left-hand column do not have to be in ascending order.
The public identifier after BASESET indicates the base character set being used to assign meanings. The public identifier must be one known to the system processing the SGML Declaration. It is used to identify the base character set to the SGML parser so that it knows where to find such characters as the letters and digits.
If the FORMAL optional feature is YES in the "feature use" in the SGML Declaration, all public identifiers in the SGML Declaration have to be formal public identifiers.
It is an error for any literal in the SGML Declaration to be longer than 240 characters, the LITLEN quantity of the Reference Concrete Syntax.
The SGML Declaration itself is always parsed in the Reference Concrete Syntax. Among other things this means that keywords in the SGML Declaration can be entered in upper- or lower-case or any mixture of the two. Upper-case is used in most of the following examples, but only to distinguish them from the surrounding text, not because they have to be that way. The SGML Declaration itself only takes effect after the end of the SGML Declaration.
A capacity is a number which sets a limit on the total number of one of the things that can appear in an SGML document. For example, the capacity called "ELEMCAP" sets a limit on the total number of different elements that can be declared. The number of elements declared in an SGML prolog is multiplied by the NAMELEN quantity (the maximum length of a name) and the product is compared to ELEMCAP. It is an error for the product to exceed ELEMCAP. Similar calculations are done on other counts, such as the number of entities and the number of ID attribute values specified.
The capacity "TOTALCAP" sets a limit on the sum of the products compared to all the other capacities.
The "Reference Capacity Set" can be selected by using its public identifier:
CAPACITY PUBLIC "ISO 8879-1986//CAPACITY Reference//EN"
The Reference Capacity Set assigns the value 35,000 to all the capacity limits.
Values can be assigned to capacity limits by replacing the public identifier with the keyword "SGMLREF" followed by one or more pairs of capacity names and limits:
CAPACITY SGMLREF
TOTALCAP 50000
ELEMCAP 40000
ENTCAP 40000
Allowed capacity names are: TOTALCAP, ENTCAP, ENTCHCAP, ELEMCAP, GRPCAP, EXGRPCAP, EXNMCAP, ATTCAP, ATTCHCAP, AVGRPCAP, NOTCAP, NOTCHCAP, IDCAP, IDREFCAP, MAPCAP, LKSETCAP, and LKNMCAP. Capacities can appear in any order.
The value assigned to TOTALCAP has to be at least as large as any other capacity value.
All unspecified capacity values are set to the SGML reference value, which is 35,000.
The Concrete Syntax Scope determines whether the concrete syntax defined in the following part of the SGML Declaration applies to the whole of the document following the SGML Declaration or just to the document instance following the prolog (i.e. not to the document type declaration).
The Concrete Syntax Scope can either
SCOPE DOCUMENT
or
SCOPE INSTANCE
If DOCUMENT, the concrete syntax applies to the whole of the SGML document following the SGML Declaration. If INSTANCE, the concrete syntax applies only to the document instance and not to the prolog. In this latter case, the Reference Concrete Syntax applies to the prolog.
The document character set is used for the whole document in either case. If the SCOPE is INSTANCE, the document character set must include all significant characters for both concrete syntaxes being used.
Note that the Capacity Set and Feature Use defined in the SGML Declaration are not part of the concrete syntax, and so apply to both the prolog and instance, no matter what the scope is.
The simpler form of the Concrete Syntax is to give the public identifier of the Public Concrete Syntax required:
SYNTAX PUBLIC "ISO 8879-1986//SYNTAX Reference//EN"
Four Public Concrete Syntaxes are defined in ISO 8879, the SGML Standard. Their public identifiers are:
"ISO 8879- 1986//SYNTAX Reference//EN"
which identifies the Reference Concrete Syntax.
"ISO 8879- 1986//SYNTAX Core//EN"
which identifies the Core Concrete Syntax. It differs from the Reference Concrete Syntax only in not defining any Short Reference Delimiters.
"ISO 8879- 1986//SYNTAX Multicode Basic//EN"
which identifies the Multicode Basic Concrete Syntax. It differs from the Reference Concrete Syntax only in that it defines a number of markup suppression characters.
"ISO 8879- 1986//SYNTAX Multicode Core//EN"
which identifies the Multicode Basic Concrete Syntax. It differs from the Reference Concrete Syntax in that it defines a number of markup suppression characters and does not define any Short Reference Delimiters.
The Reference Concrete Syntax and markup suppression are described in more detail later.
The public identifier is optionally followed by a set of "switches". For example:
SYNTAX PUBLIC "ISO 8879-1986//SYNTAX Core//EN"
SWITCHES 47 92
In this example the Core Concrete Syntax is to be used (no Short Reference Delimiters) except that in each place that character number 47 (solidus: "/") is used in the Core Concrete Syntax, character number 92 is to be used instead (back-slash: "\"). The result is a concrete syntax that is exactly like the Core Concrete Syntax except that the ETAGO delimiter is "<\" and the NET delimiter is "\".
Note that slashes and back-slashes still represent themselves in literals and in the text of the parsed document, and that slashes are still used in formal public identifiers. Only the delimiters change. Also note that if back-slash were used in any delimiter in the concrete syntax it would not be replaced by solidus: "switch" does not mean "exchange".
To exchange pairs of characters, matched pairs of switches are required. For example:
SYNTAX PUBLIC "ISO 8879-1986//SYNTAX Multicode Basic//EN" SWITCHES 60 123 123 60 62 125 125 62
In this example the Multicode Basic Concrete Syntax is to be used (Markup Suppression Characters and reference Short Reference Delimiters) except that in the delimiters, the less-than and greater-than symbols ("<" and ">") are to be exchanged with the open and closing braces ({ and }). The result of this is to make ETAGO be "{/", MDO be "{!", PIO be "{?", STAGO be "{" and each of MDC, PIC and TAGC be "}". "<" and ">" become Short Reference Delimiters, the roles that "{" and "}" have in the Multicode Basic Concrete Syntax.
ISO 8879 prohibits switching to or from a letter or a digit. That is, a letter or a digit cannot be switched to any other character, and no character can be switched to a letter or a digit.
The other form of the Concrete Syntax is to "do it yourself". The result of doing this is called a "variant concrete syntax". The Reference Concrete Syntax is defined as follows:
SYNTAX
SHUNCHAR CONTROLS
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
BASESET "ISO 646-1983//CHARSET International Reference
Version (IRV)//ESC 2/5 4/0"
DESCSET
0 128 0
FUNCTION
RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING
LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR "-."
UCNMCHAR "-."
NAMECASE GENERAL YES
ENTITY NO
DELIM
GENERAL SGMLREF
SHORTREF SGMLREF
NAMES SGMLREF
QUANTITY SGMLREF
All eight components of the concrete syntax are required.
SHUNCHAR CONTROLS
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
A "shunned" character is one which, in the words of ISO 8879, "should be avoided in documents employing the syntax because some systems might erroneously treat it as a control character" (Clause 4.297). Any of the characters whose numbers are listed and which are not assigned significance by the concrete syntax may not be assigned as the meaning of a document character in the document character set.
The keyword CONTROLS following the keyword SHUNCHAR indicates that "any character number that the document character set considers to be the coded representation of a control character, and not a graphic character, is a shunned character" (Clause 13.4.2). This is in addition to the characters whose numbers are listed. In the above example concrete syntax, the list of character numbers is, with the exception of 255, exactly a list of the control characters from ISO 646. So the above shunned character number identification could have been written as:
SHUNCHAR 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
It could also have been written as:
SHUNCHAR CONTROLS 255
The absence of any shunned characters is indicated by:
SHUNCHAR NONE
BASESET "ISO 646-1983//CHARSET International Reference
Version (IRV)//ESC 2/5 4/0"
DESCSET
0 128 0
The Syntax-Reference Character Set is defined similarly to the document character set. It is, however, used exclusively to define the concrete syntax. The character assignments within it are used only in the following concrete syntax definitions.
In a similar manner to the document character set, the characters in the syntax-reference character set are associated with those in the base character set. The above example simply says that the syntax-reference character set is the same as the base character set.
Each character in the syntax-reference character set can only be assigned once, as with the document character set, and a base character can only be assigned to one syntax-reference character (a base character can only be assigned one meaning). Other constraints are described later in this section.
The main function of the concrete syntax is to assign meanings to characters: to make them name characters or function characters, such as RE. The meaning given in the concrete syntax to each syntax-reference character is assigned to the corresponding base character. The document character set then assigns these meanings to the characters in the parsed document itself.
FUNCTION
RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
Function characters are characters that have names, and optionally, some special significance in parsing an SGML document. The definitions of RE, RS and SPACE must be entered as above, in the order given: only the numeric values of the characters assigned to these functions may be changed. Other "added functions" may be defined following, in the way that TAB is in the above.
Function characters can be entered directly in a document, as the spaces between the words in this sentence are, or can be entered using a named character reference such as &#SPACE;.
Each added function is defined by giving its name, function class and the numeric value of the function character. The allowed function classes are: SEPCHAR, MSSCHAR, MSOCHAR, MSICHAR and FUNCHAR.
Only one function may be defined for each character. No two function characters can have the same name. The names "RE", "RS" and "SPACE" are reserved for the three required functions and cannot be assigned to any added function.
A SEPCHAR is a separator character that is allowed in "white space" and in blank sequences in short reference delimiters. The most common example is the TAB character, as defined above.
An MSSCHAR is a markup suppression character. When used in a document it suppresses the recognition as markup of the immediately following character. For example, with the following definition, an ampersand could be entered in a document as "\&": the "\" would suppress the usual recognition of "&" and so it would be parsed as data. Note that "\" is character number 92 in the syntax-reference character set being used in this example.
BACKSL MSSCHAR 92
Markup suppression characters are themselves text. The "\" in this example is passed through to any application that uses the results of parsing an SGML document. If necessary, that application is expected to discard the MSSCHAR.
MSOCHAR and MSICHAR are also markup suppression character. They suppress markup recognition similarly to MSSCHAR characters except that an MSOCHAR suppresses all markup recognition until an MSICHAR is found or until the end of the entity or document in which the MSOCHAR was found. For example:
MSO MSOCHAR 96
MSI MSICHAR 36
In this example, character number 96 ("`" in the example syntax-reference character set) is defined as an MSOCHAR with the name MSO, and character number 36 ("$") is defined as an MSICHAR with the name MSI. So if the string "`<foo></foo>$" were entered in text it would all be recognized as data.
Like the MSSCHAR, MSOCHAR and MSICHAR characters are data characters and are not discarded by an SGML parser. An MSICHAR is allowed in text without a corresponding preceding MSOCHAR: it is simply treated as a data character.
If any character is defined to be an MSOCHAR, then at least one other character must be defined to be an MSICHAR.
A FUNCHAR is an "inert" function character. It is simply a character that has a name but no particular function. If character number 127 is to be allowed in a document, for example, it can be defined as an inert function character by:
DEL FUNCHAR 127
The document character set of any concrete syntax that defines this particular function character must assign it to some particular document character (because it is a "significant SGML character" as described in the section on the Document Character Set). It can then be referred to in a document as &#DEL;.
A concrete syntax that defines markup suppression characters is called a "multicode concrete syntax". Control characters are very often used as markup suppression characters. The two multicode public concrete syntaxes defined in ISO 8879 do so. The Multicode Basic Concrete Syntax differs from the Reference Concrete Syntax in defining the following set of function characters, which is that of the Reference Concrete Syntax with the addition of five markup suppression characters:
FUNCTION
RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
ESC MSOCHAR 27
LS0 MSICHAR 15
LS1 MSOCHAR 14
SS2 MSSCHAR 142
SS3 MSSCHAR 143
The Multicode Core Concrete Syntax differs from the Core Concrete Syntax in defining the same set of additional markup suppression characters.
NAMING
LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR "-."
UCNMCHAR "-."
NAMECASE GENERAL YES
ENTITY NO
Naming rules determine what characters can appear in names and whether or not case is significant in names. The above naming rules specify that names can start with letters, and that names can contain letters, digits, dashes ("-") and periods ("."). They also specify that case is significant in entity names (e.g. the entities named "FOO", "foo" and "Foo" are all different entities), but not in other names (e.g. the elements named "BAR", "bar" and "Bar" are all the same entity).
The first character of a name using a particular concrete syntax must be a letter or one of the characters in the literals that follow the LCNMSTRT and UCNMSTRT keywords in the naming rules. The other characters in a name must be letters, digits or among the characters in the literals following the LCNMSTRT, UCNMSTRT, LCNMCHAR and UCNMCHAR keywords in the naming rules.
The separation of additional name characters into "UC" and "LC" parts makes it possible to define which of the characters are the "upper-case" of others. Each character in the UCNMSTRT literal is the upper-case equivalent of the correspondingly positioned character in the LCNMSTRT literal. If a character has no lower-case form the same character is put in both literals. The literals must, of course, be the same length. The literals given for LCNMCHAR and UCNMCHAR work in the same way.
These literals are all parameter literals, so they can contain character references.
The NAMECASE part of the naming rules specifies whether case is significant for names. Either YES or NO can be specified for GENERAL names and for ENTITY names. Entity names are the names of general and parameter entities. General names are all other names, including function character names and reserved names. If NAMECASE GENERAL is NO, for example, an element declaration must start with "<!ELEMENT" (when using the reference delimiters and reserved names). If NAMECASE GENERAL is YES, it can have lower-case letters in it, an can be entered as "<!element" or "<!Element", among others.
The SGML Declaration itself is parsed using the Reference Concrete Syntax. This means that any keyword in the SGML Declaration can be entered with upper- or lower-case letters.
DELIM
GENERAL SGMLREF
The above form indicates that the reference SGML values are to be used for each of the General Delimiters. Any general delimiter can be redefined as follows:
DELIM
GENERAL SGMLREF
GRPO "{"
GRPC "}"
This example specifies that the GRPO general delimiter is to be the open brace ("{"), and that the GRPC general delimiter is to be the closing brace ("}"). In other words, wherever parentheses would normally appear in markup, braces must be used instead. Note that this does not affect the use of parentheses and braces in text (PCDATA) or in attribute values.
A general delimiter is redefined by giving its name followed by its redefinition. They do not have to be given in any particular order. It is an error for the same delimiter string to be assigned to two general delimiters that can appear in the same parsing context in the document. The different parsing contexts are listed in Figure 3 in ISO 8879 and are described in the accompanying text.
The redefinition of a general delimiter is a parameter literal, so it can include character references. It is limited to 240 characters, the LITLEN value of the Reference Concrete Syntax.
ISO 8879 dictates that general delimiters cannot consist solely of function characters.
All the general delimiters can be redefined. Any that are not redefined take their reference values. The following definition lists all the general delimiters by name along with their reference values:
GENERAL SGMLREF
AND "&"
COM "--"
CRO "&#"
DSC "]"
DSO "["
DTGC "]"
DTGO "["
ERO "&"
ETAGO "</"
GRPC ")"
GRPO "("
LIT "
LITA ""
MDC ">"
MDO "<!"
MINUS "-"
MSC "]]"
NET "/"
OPT "?"
OR "|"
PERO "%"
PIC ">"
PIO "<?"
PLUS "+"
REFC ";"
REP "*"
RNI "#"
SEQ ","
STAGO "<"
TAGC ">"
VI "="
If NAMECASE GENERAL is YES in the naming rules, all letters in general and short reference delimiters are interpreted as if they were entered in their upper-case form.
SHORTREF SGMLREF
The above form indicates that the reference SGML values are to be used for the Short Reference Delimiters. Short reference delimiters can be added to the set of reference values as follows:
SHORTREF SGMLREF
"\"
"---"
This example specifies that all the reference short reference delimiters plus the character sequences "\" and "---" are to be used as short reference delimiters. If the reference short reference delimiters are not wanted, the word SGMLREF can be replaced by the word NONE, as in the following example, which specifies that "\" and "---" are to be the only short reference delimiters.
SHORTREF NONE
"\"
"---"
SHORTREF NONE by itself specifies that there are no short reference delimiters.
Short reference delimiters are defined by listing them following the words "SHORTREF SGML" or "SHORTREF NONE". They do not have to be given in any particular order. It is an error for the same delimiter string to be assigned to two short reference delimiters or to a short reference delimiter and a general delimiter that can appear in the same parsing context as a short reference delimiter. The different parsing contexts are listed in Figure 3 in ISO 8879 and are described in the accompanying text.
The redefinition of a short reference delimiter is a parameter literal, so it can include character references. It is limited to 240 characters, the LITLEN value of the Reference Concrete Syntax.
The following definition lists all the short reference delimiters defined by the Reference Concrete Syntax:
SHORTREF NONE
"&#SPACE;" "&#TAB;" "&#RE;" "&#RS;"
"&#RS;&#RE;" "&#RS;B" "B&#RE;" "B&#RE;" "BB"
" "#" "%" "" "(" ")" "*" "," "-"
":" ";" "=" "@" "+" "[" "]"
"^" "_" "{" "|" "}" "~" "--"
A short reference delimiter can contain a single sequence of upper-case letter Bs, which match with strings of "white-space" of at least the same length as the number of Bs. There cannot be more than one such sequence in a short-reference delimiter, and it cannot be adjacent to a white-space character: a SPACE or a SEPCHAR.
Note that the DELIM keyword that precedes the GENERAL keyword in the concrete syntax applies to both the GENERAL and SHORTREF delimiter definitions.
NAMES SGMLREF
The above form indicates that the reference SGML values are to be used for each of the reserved names. Any "reserved name" can be redefined as follows:
NAMES SGMLREF
DOCTYPE DTD
ELEMENT EL
PCDATA TEXT
This example specifies that a document type declaration starts with "<!DTD" rather than "<!DOCTYPE", that an element declaration starts with "<!EL" rather than "<!ELEMENT", and that parsed character data in a content model is allowed by the indicator "#TEXT" rather than "#PCDATA".
A reserved name is redefined by giving its reference value followed by its redefinition. They do not have to be given in any particular order.
Reserved names are not especially "reserved". They are more like keywords. Their context always determines that they are a "reserved name" rather than some other sort of name. For example, there is no restriction on using so-called reserved names as element or entity names.
The reserved names that can be redefined are all those that can be used in the prolog (other than in the SGML Declaration itself) and the document instance. They are:
ANY ATTLIST CDATA CONREF CURRENT DEFAULT DOCTYPE ELEMENT EMPTY ENDTAG ENTITIES ENTITY FIXED ID IDLINK IDREF IDREFS IGNORE IMPLIED INCLUDE INITIAL LINK LINKTYPE MS NAME NAMES NDATA MNTOKEN NMTOKENS MD NOTATION NUMBER NUMBERS NUTOKEN NUTOKENS O PCDATA PI POSTLINK PUBLIC RCDATA REQUIRED RESTORE SDATA SHORTREF SIMPLE STARTTAG SUBDOC SYSTEM TEMP USELINK USEMAP
Because of the way in which the SGML Declaration is defined in ISO 8879, the new names must be names in the reference concrete syntax. That is, the characters used in names are limited to the letters, digits, "." and "-".
The new names can be entered in any combination of upper- and lower-case letters. If NAMECASE GENERAL is YES in the naming rules in the SGML Declaration all such combinations are equivalent: case is not significant. If NAMECASE GENERAL is NO, the case in which the new names are entered is the one in which they must appear in the document.
Each reserved name must be different from every other reserved name.
QUANTITY SGMLREF
The above form indicates that the reference SGML values are to be used for each of the quantities. Any quantity value can be redefined as follows:
QUANTITY SGMLREF
NAMELEN 32
LITLEN 2048
This example uses all the reference quantity values except that NAMELEN is 32 and LITLEN is 2,048.
Unlike the capacity set, where at least one capacity value must be specified if "SGMLREF" is used, the quantity set need not have any redefined values.
If NAMELEN is made smaller that the reference value (8), all the reserved names longer than that must have been redefined to shorter names in the preceding reserved names set.
FEATURES
MINIMIZE DATATAG NO OMITTAG NO RANK NO
SHORTTAG NO
LINK SIMPLE NO IMPLICIT NO EXPLICIT NO
OTHER CONCUR NO SUBDOC NO FORMAL NO
All optional SGML features must be listed in the Feature Use part of the SGML Declaration, and they must be listed in the order given. The features are divided into three groups: minimization features, link features and "other" features. This grouping has no effect on which features can be enabled together.
The example above specifies that the support for no optional feature is to be enabled for parsing the SGML document. Each feature can be enabled by replacing each instance of the word "NO" by "YES" (for DATATAG, OMITTAG, RANK, SHORTTAG, IMPLICIT or FORMAL) or "YES" followed by a number (for SIMPLE, EXPLICIT, CONCUR or SUBDOC). The number, when present, indicates the count of active Simple Links, active Explicit Links, active Concurrent Documents or nested Subdocuments allowed simultaneously.
The application-specific information is not used in parsing, but is available to the application that receives the output of the SGML parser. It can be specified either as :
APPINFO NONE
or by the keyword "APPINFO" followed by a minimum literal string.
NONE indicates that there is no such information. If a minimum literal is given, it contains information that can be provided to any application using the document. In the following example, the application would be given the text "xyz".
APPINFO "xyz"
[Next Chapter] [Table of Contents] ©Copyright Exoterica Corporation, All rights reserved.