Understanding The SGML Declaration

Understanding The SGML Declaration

[Previous Chapter] [Table of Contents]


4. Examples of SGML Declarations

This section contains three examples of SGML Declarations. The first illustrates using various parts of the SGML Declaration, the second illustrates supporting character sets other than ASCII, and the third illustrates the flexibility of SGML as a grammar-definition tool.

4.1. Putting It All Together

The following is an example SGML Declaration that illustrates many of the things that can be done. It is rather a "hodge podge" of definitions and it is not intended for any purpose other than illustration.

Practical SGML Declarations vary as little as possible from the reference versions and only use the features and capabilities needed for a particular application.

<!SGML "ISO 8879:1986"
   CHARSET
      BASESET "ISO 646-1983//CHARSET International Reference
               Version (IRV)//ESC 2/5 4/0"
      DESCSET
          0    9  UNUSED
          9    2    9
         11    2  UNUSED
         13    1   13
         14   18  UNUSED
         32   96   32
              -- Allow character 127 in documents as well.--
        128  127  "High-order characters"
                  -- High-order characters are data. --
        255    1  UNUSED -- Except that #255 is non-SGML. --

   CAPACITY SGMLREF
      TOTALCAP  50000 -- Set all the capacities to 50000. --
      ENTCAP    50000
      ENTCHCAP  50000
      ELEMCAP   50000
      GRPCAP    50000
      EXGRPCAP  50000
      EXNMCAP   50000
      ATTCAP    50000
      ATTCHCAP  50000
      AVGRPCAP  50000
      NOTCAP    50000
      NOTCHCAP  50000
      IDCAP     50000
      IDREFCAP  50000
      MAPCAP    50000
      LKSETCAP  50000
      LKNMCAP   50000

   SCOPE INSTANCE -- The concrete syntax only applies
                     to the document instance. --

   SYNTAX
      SHUNCHAR 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
         18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
         -- Shun just the above base characters. --
      BASESET "ISO 646-1983//CHARSET International Reference
               Version (IRV)//ESC 2/5 4/0"
      DESCSET
          0  128   0
      FUNCTION
         -- All the reference function characters. --
         RE            13
         RS            10
         SPACE         32
         TAB SEPCHAR    9
         FF  SEPCHAR   12
         -- Except that form-feed is also white-space. --
         DEL FUNCHAR  127
         -- And DEL is an inert function. --
      NAMING
         LCNMSTRT "@$"
      -- @ and $ can appear in names and can start them. --
         UCNMSTRT "@$"
      -- They are their own upper-case forms. --
         LCNMCHAR ""
      -- No other name characters, not even . and -. --
         UCNMCHAR ""
         NAMECASE GENERAL NO
                  ENTITY  NO
                  -- Case is significant in all names. --
      DELIM
         GENERAL SGMLREF
 -- Redefine three General Delimiters.  With the following
    definitions, declarations are entered as <!!! ... !!!>
    instead of the usual <! ... > and comment declarations are
    entered as <!!!*** ... ***!!!>.
 --
            COM    "***"
            MDO    "<!!!"
            MDC   "!!!>"
         SHORTREF NONE
 -- Define only two Short Reference Delimiters.  They could
    be used as escape sequences for < and &.
 --
            "\<"
            "\&"
      NAMES SGMLREF
    -- Change two of the keywords used in marked sections. --
         IGNORE  SKIP
         INCLUDE DONTSKIP
      QUANTITY SGMLREF -- Change NAMELEN and LITLEN. --
         NAMELEN   32
         LITLEN  2048

   FEATURES

      MINIMIZE  DATATAG  YES  OMITTAG  YES   RANK     YES

                SHORTTAG YES
      LINK      SIMPLE   NO   IMPLICIT NO    EXPLICIT NO
      OTHER     CONCUR   NO   SUBDOC   YES 1 FORMAL   YES

   APPINFO "WARNINGS YES"
            -- Pass "WARNINGS YES" to the application. --
>

4.2. Supporting non-ISO 646 (ASCII) Character Sets

Support of a non-ISO 646 character set does not require a change in the concrete syntax used: only the document character set definition needs to be changed. For example, the following document character set would serve for parsing a document coded in the EBCDIC character set.

EBCDIC has no "[" and "]" characters and has two extra characters: the cents symbol and the "not" symbol. The document character set solves this difficulty by simply assigning the EBCDIC cents symbol the meaning of "[" and assigning the EBCDIC "not" symbol the meaning of "]". All other EBCDIC characters are assigned the meanings of the corresponding characters from ISO 646.

   CHARSET
      BASESET "ISO 646-1983//CHARSET International Reference
               Version (IRV)//ESC 2/5 4/0"
      DESCSET
          0    5 UNUSED
          5    1   9     -- TAB (EBCDIC HT) --
          6   7 UNUSED
         13    1  13     -- RE (EBCDIC CR) --
         14   23 UNUSED
         37    1  10     -- RS (EBCDIC LF) --
         38   26 UNUSED
         64    1  32     -- SPACE --
         65   10 UNUSED
         74    1  91     -- [ (EBCDIC "cents" symbol) --
         75    1  46     -- . --
         76    1  60     -- < --
         77    1  40     -- ( --
         78    1  43     -- + --
         79    1 124     -- | --
         80    1  38     -- & --
         81    9 UNUSED
         90    1  33     -- ! --
         91    1  36     -- $ --
         92    1  42     -- * --
         93    1  41     -- ) --
         94    1  59     -- ; --
         95    1  93     -- ] (EBCDIC "not" symbol) --
         96    1  45     -- - --
         97    1  47     -- / --
         98    9 UNUSED
        107    1  44     -- , --
        108    1  37     -- % --
        109    1  95     -- _ --
        110    1  62     -- > --
        111    1  63     -- ? --
        112    9 UNUSED
        121    1  96     -- ` --
        122    1  58     -- : --
        123    1  35     -- # --
        124    1  64     -- @ --
        125    1  39     --  --
        126    1  61     -- = --
        127    1  34     -- " --
        128    1 UNUSED
        129    9  97     -- abcdefghi --
        138    7 UNUSED
        145    9 106     -- jklmnopqr --
        154    7 UNUSED
        161    1 126     -- ~ --
        162    8 115     -- stuvwxyz --
        170   22 UNUSED
        192    1 123     -- { --
        193    9  65     -- ABCDEFGHI --
        202    6 UNUSED
        208    1 125     -- } --
        209    9  74     -- JKLMNOPQR --
        218    6 UNUSED
        224    1  92     -- \ --
        225    1 UNUSED
        226    8  83     -- STUVWXYZ --
        234    6 UNUSED
        240   10  48     -- 0123456789 --
        250    6 UNUSED

4.3. Parsing RTF

The following is an SGML Document Entity that contains the grammar for, and an example of, a small subset of Rich Text Format, a word-processor interchange language developed by Microsoft.

<!SGML "ISO 8879:1986"
   CHARSET
      BASESET "ISO 646-1983//CHARSET International Reference
               Version (IRV)//ESC 2/5 4/0"
      DESCSET
          0    9  UNUSED
          9    2    9
         11    2  UNUSED
         13    1   13
         14   18  UNUSED
         32   95   32
        127    1  UNUSED

   CAPACITY PUBLIC "ISO 8879-1986//CAPACITY Reference//EN"
   SCOPE INSTANCE
   SYNTAX
      SHUNCHAR 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
               18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255
      BASESET "ISO 646-1983//CHARSET International Reference
               Version (IRV)//ESC 2/5 4/0"
      DESCSET
          0  128   0
      FUNCTION
         RE            13
         RS            10
         SPACE         32
         TAB SEPCHAR    9
      NAMING
         LCNMSTRT ""
         UCNMSTRT ""
         LCNMCHAR "-."
         UCNMCHAR "-."
         NAMECASE GENERAL NO
                  ENTITY  NO
      DELIM
         GENERAL SGMLREF
         SHORTREF NONE
            "&#RE;" "&#SPACE" "\'" "{" "}" "\{" "\}"
            "0" "1" "2" "3" "4" "5" "6" "7"
            "8" "9" "a" "b" "c" "d" "e" "f"
            "\b" "\par" "\f" "\fs"
      NAMES SGMLREF
      QUANTITY SGMLREF
   FEATURES
      MINIMIZE	DATATAG	NO   OMITTAG  YES   RANK     NO   SHORTTAG YES
      LINK		SIMPLE   NO   IMPLICIT NO    EXPLICIT NO
      OTHER		CONCUR   NO   SUBDOC   NO    FORMAL   YES
   APPINFO NONE
>
<!DOCTYPE rtfdoc [
<!ELEMENT rtfdoc      - O   (rtf)>
<!ENTITY % command    "b | par | f | fs">
<!ELEMENT rtf         - -   (rtf | %command; | rtfchar | #PCDATA)*>
<!ELEMENT (%command;) - O   (#PCDATA)>
<!ELEMENT rtfchar     - -   (high, low)>
<!ELEMENT (high, low) - O   EMPTY>
<!ATTLIST (high, low) value (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
                             8 | 9 | A | B | C | D | E | F)
                            #REQUIRED>

<!SHORTREF rtfmap "\b" b "\par" par "\f" f "\fs" fs "\'" rtfchar
                  "\{" openbrc "\}" closebrc "{" s.rtf "}" e.rtf
                  "&#RE;" null>
<!SHORTREF cmdmap "\b" b "\par" par "\f" f "\fs" fs "\'" rtfchar
                  "\{" openbrc "\}" closebrc "{" s.rtf "}" e.rtf
                  "&#RE;" null "&#SPACE" e.tag>
<!ENTITY b        STARTTAG "b">
<!ENTITY par      STARTTAG "par">
<!ENTITY f        STARTTAG "f">
<!ENTITY fs       STARTTAG "fs">
<!ENTITY s.rtf    STARTTAG "rtf">
<!ENTITY e.rtf    ENDTAG   "rtf">
<!ENTITY rtfchar  STARTTAG "rtfchar">
<!ENTITY e.tag    ENDTAG   "">
<!ENTITY openbrc  CDATA    "{">
<!ENTITY closebrc CDATA    "}">
<!ENTITY null     "">
<!USEMAP rtfmap   rtfdoc>
<!USEMAP cmdmap   (%command;)>

<!SHORTREF high   "0" high0 "1" high1 "2" high2 "3" high3
                  "4" high4 "5" high5 "6" high6 "7" high7
                  "8" high8 "9" high9 "a" highA "b" highB
                  "c" highC "d" highD "e" highE "f" highF>
<!ENTITY high0    "<high 0><!USEMAP low>">
<!ENTITY high1    "<high 1><!USEMAP low>">
<!ENTITY high2    "<high 2><!USEMAP low>">
<!ENTITY high3    "<high 3><!USEMAP low>">
<!ENTITY high4    "<high 4><!USEMAP low>">
<!ENTITY high5    "<high 5><!USEMAP low>">
<!ENTITY high6    "<high 6><!USEMAP low>">
<!ENTITY high7    "<high 7><!USEMAP low>">
<!ENTITY high8    "<high 8><!USEMAP low>">
<!ENTITY high9    "<high 9><!USEMAP low>">
<!ENTITY highA    "<high A><!USEMAP low>">
<!ENTITY highB    "<high B><!USEMAP low>">
<!ENTITY highC    "<high C><!USEMAP low>">
<!ENTITY highD    "<high D><!USEMAP low>">
<!ENTITY highE    "<high E><!USEMAP low>">
<!ENTITY highF    "<high F><!USEMAP low>">
<!USEMAP high     rtfchar>

<!SHORTREF low    "0" low0 "1" low1 "2" low2 "3" low3
                  "4" low4 "5" low5 "6" low6 "7" low7
                  "8" low8 "9" low9 "a" lowA "b" lowB
                  "c" lowC "d" lowD "e" lowE "f" lowF>
<!ENTITY low0     "<low 0></rtfchar>">
<!ENTITY low1     "<low 1></rtfchar>">
<!ENTITY low2     "<low 2></rtfchar>">
<!ENTITY low3     "<low 3></rtfchar>">
<!ENTITY low4     "<low 4></rtfchar>">
<!ENTITY low5     "<low 5></rtfchar>">
<!ENTITY low6     "<low 6></rtfchar>">
<!ENTITY low7     "<low 7></rtfchar>">
<!ENTITY low8     "<low 8></rtfchar>">
<!ENTITY low9     "<low 9></rtfchar>">
<!ENTITY lowA     "<low A></rtfchar>">
<!ENTITY lowB     "<low B></rtfchar>">
<!ENTITY lowC     "<low C></rtfchar>">
<!ENTITY lowD     "<low D></rtfchar>">
<!ENTITY lowE     "<low E></rtfchar>">
<!ENTITY lowF     "<low F></rtfchar>">
]>
{\b\f2\fs24 One paragraph.\par 
Another paragraph with a {\b bold} wor
d, \{braces\}, and a special character
 at the end.\'a5\par}

The instance following the DTD is SGML even though it does not use the familiar angle brackets "<" and ">". Most uses of SGML do not require the definition of markup languages this far from the norm. On the other hand, this is a good illustration that markup languages can be tailored to individual needs.

Following is the instance of the sample document in a more familiar form to illustrate the structure and information that it captured. Note that it is more clumsy than the RTF.

<rtfdoc>
<rtf><b></b><f>
2
</f><fs>
24
</fs>One paragraph.<par></par>Another paragraph with a <rtf><b></b>bold
</rtf> word, {braces}, and a special character at the end.<rtfchar>
<high value="A">
<low value="5">
</rtfchar><par></par></rtf>
</rtfdoc>

[First Annex] [Table of Contents] ©Copyright Exoterica Corporation, All rights reserved.