2.1 Character Set
1/2
{
AI95-00285-01}
{
AI95-00395-01}
{character set} The
character repertoire for the text of an Ada program
consists of the entire coding space described by the ISO/IEC 10646:2003
Universal Multiple-Octet Coded Character Set. This coding space is organized
in planes, each plane comprising 65536 characters.{plane
(character)} {character
plane} only
characters allowed outside of comments are
the graphic_characters and format_effectors.
1.a/2
This paragraph
was deleted.Ramification: {
AI95-00285-01}
Any character, including an other_control_function,
is allowed in a comment.
1.b/2
This paragraph
was deleted.{
AI95-00285-01}
Note that this rule doesn't really have much force,
since the implementation can represent characters in the source in any
way it sees fit. For example, an implementation could simply define that
what seems to be a non-graphic, non-format-effector character is actually
a representation of the space character.
1.c/2
Discussion: {
AI95-00285-01}
It is our intent to follow the terminology of
ISO/IEC
10646:2003 ISO 10646 BMP where appropriate,
and to remain compatible with the character classifications defined in
A.3, “
Character
Handling”.
Note that our definition for
graphic_character is more inclusive than that
of ISO 10646-1.
Syntax
2/2
This paragraph
was deleted.{
AI95-00285-01}
character ::= graphic_character | format_effector | other_control_function
3/2
This paragraph
was deleted.{
AI95-00285-01}
graphic_character ::= identifier_letter | digit | space_character | special_character
3.1/2
{
AI95-00285-01}
{
AI95-00395-01}
A character is defined
by this International Standard for each cell in the coding space described
by ISO/IEC 10646:2003, regardless of whether or not ISO/IEC 10646:2003
allocates a character to that cell.
Static Semantics
4/2
{
AI95-00285-01}
{
AI95-00395-01}
The
character repertoire for the text of an Ada
program consists of the collection of characters described
by the ISO/IEC 10646:2003 called
the Basic Multilingual Plane (BMP) of the ISO 10646 Universal Multiple-Octet
Coded Character Set, plus a set of format_effectors
and, in comments only, a set of other_control_functions;
the coded representation for
these characters
is implementation defined [(it need not be a representation defined within
ISO/IEC 10646:2003 ISO-10646-1)].
A character whose relative code position in its plane is 16#FFFE# or
16#FFFF# is not allowed anywhere in the text of a program.
4.a
Implementation defined: The coded representation
for the text of an Ada program.
4.b/2
Ramification: {
AI95-00285-01}
Note that this rule doesn't really have much force,
since the implementation can represent characters in the source in any
way it sees fit. For example, an implementation could simply define that
what seems to be an other_private_use character
is actually a representation of the space character.
4.1/2
{
AI95-00285-01}
The semantics of an Ada program whose text is not
in Normalization Form KC (as defined by section 24 of ISO/IEC 10646:2003)
is implementation defined.
4.c/2
Implementation defined:
The semantics of an Ada program whose
text is not in Normalization Form KC.
5/2
{
AI95-00285-01}
The description of the language definition in this International Standard
uses the
character properties General Category,
Simple Uppercase Mapping, Uppercase Mapping, and Special Case Condition
of the documents referenced by the note in section 1 of ISO/IEC 10646:2003 graphic
symbols defined for Row 00: Basic Latin and Row 00: Latin-1 Supplement
of the ISO 10646 BMP; these correspond to the graphic symbols of ISO
8859-1 (Latin-1); no graphic symbols are used in this International Standard
for characters outside of Row 00 of the BMP. The actual set of
graphic symbols used by an implementation for the visual representation
of the text of an Ada program is not specified.
{unspecified
[partial]}
6/2
{
AI95-00285-01}
Characters The categories
of characters are
categorized defined
as follows:
6.a/2
Discussion: Our
character classification considers that the cells not allocated in ISO/IEC
10646:2003 are graphic characters, except for those whose relative code
position in their plane is 16#FFFE# or 16#FFFF#. This seems to provide
the best compatibility with future versions of ISO/IEC 10646, as future
characters can be already be used in Ada character and string literals.
7/2
This paragraph
was deleted.{
AI95-00285-01}
{identifier_letter}
identifier_letter
upper_case_identifier_letter
| lower_case_identifier_letter
7.a/2
This paragraph
was deleted.Discussion: {
AI95-00285-01}
We use identifier_letter
instead of simply letter because ISO 10646
BMP includes many other characters that would generally be considered
"letters."
8/2
{
AI95-00285-01}
{letter_uppercase}
letter_uppercase {upper_case_identifier_letter}
upper_case_identifier_letter
Any character whose General Category is defined
to be “Letter, Uppercase” of
Row 00 of ISO 10646 BMP whose name begins “Latin Capital Letter”.
9/2
{
AI95-00285-01}
{letter_lowercase}
letter_lowercase {lower_case_identifier_letter}
lower_case_identifier_letter
Any character whose General Category is defined
to be “Letter, Lowercase” of
Row 00 of ISO 10646 BMP whose name begins “Latin Small Letter”.
9.a/1
This paragraph
was deleted.To be honest: {
8652/0001}
{
AI95-00124-01}
The above rules do not include the ligatures Æ
and æ. However, the intent is to include these characters as identifier
letters. This problem was pointed out by a comment from the Netherlands.
9.1/2
Any character whose General Category is defined
to be “Letter, Titlecase”.
9.2/2
Any character whose General Category is defined
to be “Letter, Modifier”.
9.3/2
Any character whose General Category is defined
to be “Letter, Other”.
9.4/2
Any character whose General Category is defined
to be “Mark, Non-Spacing”.
9.5/2
Any character whose General Category is defined
to be “Mark, Spacing Combining”.
10/2
Any character whose General Category is defined
to be “Number, Decimal” One
of the characters 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.
10.1/2
Any character whose General Category is defined
to be “Number, Letter”.
10.2/2
Any character whose General Category is defined
to be “Punctuation, Connector”.
10.3/2
Any character whose General Category is defined
to be “Other, Format”.
11/2
{
AI95-00285-01}
{separator_space}
separator_space {space_character}
space_character
Any character whose General Category is defined
to be “Separator, Space”. The
character of ISO 10646 BMP named “Space”.
12/2
{
AI95-00285-01}
{separator_line}
separator_line {special_character}
special_character
Any character whose General Category is defined
to be “Separator, Line”. of
the ISO 10646 BMP that is not reserved for a control function, and is
not the space_character, an identifier_letter,
or a digit.
12.a/2
This paragraph
was deleted.Ramification: {
AI95-00285-01}
Note that the no break space and soft hyphen are
special_characters, and therefore graphic_characters.
They are not the same characters as space and hyphen-minus.
12.1/2
Any character whose General Category is defined
to be “Separator, Paragraph”.
13/2
The
characters whose code positions are 16#09#
(CHARACTER TABULATION), 16#0A# (LINE FEED), 16#0B# (LINE TABULATION),
16#0C# (FORM FEED), 16#0D# (CARRIAGE RETURN), 16#85# (NEXT LINE), and
the characters in categories separator_line
and separator_paragraph control
functions of ISO 6429 called character tabulation (HT), line tabulation
(VT), carriage return (CR), line feed (LF), and form feed (FF).
{control character: See also format_effector}
13.a/2
Discussion: ISO/IEC
10646:2003 does not define the names of control characters, but rather
refers to the names defined by ISO/IEC 6429:1992. These are the names
that we use here.
13.1/2
Any character whose General Category is defined
to be “Other, Control”, and which is not defined to be a
format_effector.
13.2/2
Any character whose General Category is defined
to be “Other, Private Use”.
13.3/2
Any character whose General Category is defined
to be “Other, Surrogate”.
14/2
Any character that is not in the categories other_control,
other_private_use, other_surrogate,
format_effector, and whose relative code position
in its plane is neither 16#FFFE# nor 16#FFFF#. Any
control function, other than a format_effector,
that is allowed in a comment; the set of other_control_functions
allowed in comments is implementation defined. {control
character: See also other_control_function}
14.a/2
This paragraph
was deleted.Implementation defined:
The control functions allowed in comments.
14.b/2
Discussion: {
AI95-00285-01}
We considered basing the definition of lexical
elements on Annex A of ISO/IEC TR 10176 (4th edition), which lists the
characters which should be supported in identifiers for all programming
languages, but we finally decided against this option. Note that it is
not our intent to diverge from ISO/IEC TR 10176, except to the extent
that ISO/IEC TR 10176 itself diverges from ISO/IEC 10646:2003 (which
is the case at the time of this writing [January 2005]).
14.c/2
More
precisely, we intend to align strictly with ISO/IEC 10646:2003. It must
be noted that ISO/IEC TR 10176 is a Technical Report while ISO/IEC 10646:2003
is a Standard. If one has to make a choice, one should conform with the
Standard rather than with the Technical Report. And, it turns out that
one must make a choice because there are important differences
between the two:
14.d/2
- ISO/IEC
TR 10176 is still based on ISO/IEC 10646:2000 while ISO/IEC 10646:2003
has already been published for a year. We cannot afford to delay the
adoption of our amendment until ISO/IEC TR 10176 has been revised.
14.e/2
- There are
considerable differences between the two editions of ISO/IEC 10646, notably
in supporting characters beyond the BMP (this might be significant for
some languages, e.g. Korean).
14.f/2
- ISO/IEC
TR 10176 does not define case conversion tables, which are essential
for a case-insensitive language like Ada. To get case conversion tables,
we would have to reference either ISO/IEC 10646:2003 or Unicode, or we
would have to invent our own.
14.g/2
For
the purpose of defining the lexical elements of the language, we need
character properties like categorization, as well as case conversion
tables. These are mentioned in ISO/IEC 10646:2003 as useful for implementations,
with a reference to Unicode. Machine-readable tables are available on
the web at URLs:
14.h/2
14.i/2
with
an explanatory document found at URL:
14.j/2
14.k/2
The actual text of the
standard only makes specific references to the corresponding clauses
of ISO/IEC 10646:2003, not to Unicode.
15/2
{
AI95-00285-01}
{names of special_characters}
{special_character
(names)} The following names are
used when referring to certain
characters (the
first name is that given in ISO/IEC 10646:2003) special_characters:
{quotation mark} {number
sign} {ampersand}
{apostrophe}
{tick}
{left parenthesis}
{right parenthesis}
{asterisk}
{multiply}
{plus sign}
{comma}
{hyphen-minus}
{minus}
{full stop}
{dot}
{point}
{solidus}
{divide}
{colon}
{semicolon}
{less-than sign}
{equals sign}
{greater-than sign}
{low line}
{underline}
{vertical line}
{exclamation
point} {percent
sign} {left
square bracket} {right
square bracket} {left
curly bracket} {right
curly bracket}
15.a/2
Discussion: {
AI95-00285-01}
{graphic symbols} {glyphs}
This table serves to show the correspondence between
ISO/IEC 10646:2003 names and the graphic symbols (glyphs) used in this
International Standard. These are the characters These
are the ones that play a special role in the syntax of Ada
95, or in the syntax rules; we don't bother to define names for all characters.
The first name given is the name from ISO 10646-1; the subsequent names,
if any, are those used within the standard, depending on context.
graphic symbol | name | graphic symbol | name |
|
| | | |
|
" | quotation mark | : | colon |
|
# | number sign | ; | semicolon |
|
& | ampersand | < | less-than sign |
|
' | apostrophe, tick | = | equals sign |
|
( | left parenthesis | > | greater-than sign |
|
) | right parenthesis | _ | low line, underline |
|
* | asterisk, multiply | | | vertical line |
|
+ | plus sign | / [ | solidus, divide left square bracket |
|
, | comma | ! ] | exclamation point right square bracket |
|
– | hyphen-minus, minus | % { | percent sign left curly bracket |
|
. | full stop, dot, point | } | right curly bracket |
|
/ | solidus, divide | | |
|
Implementation Permissions
16/2
This paragraph was
deleted.{
AI95-00285-01}
In a nonstandard mode, the implementation may support
a different character repertoire[; in particular, the set of characters
that are considered identifier_letters can
be extended or changed to conform to local conventions].
16.a/2
This paragraph
was deleted.Ramification: {
AI95-00285-01}
If an implementation supports other character sets,
it defines which characters fall into each category, such as “identifier_letter,”
and what the corresponding rules of this section are, such as which characters
are allowed in the text of a program.
17/2
1 {
AI95-00285-01}
The characters in categories other_control,
other_private_use, and other_surrogate
are only allowed in comments Every code
position of ISO 10646 BMP that is not reserved for a control function
is defined to be a graphic_character by this
International Standard. This includes all code positions other than 0000
- 001F, 007F - 009F, and FFFE - FFFF.
18
2 The language does not specify the source
representation of programs.
18.a/2
Discussion: Any source representation
is valid so long as the implementer can produce an (information-preserving)
algorithm for translating both directions between the representation
and the standard character set. (For example, every character in the
standard character set has to be representable, even if the output devices
attached to a given computer cannot print all of those characters properly.)
From a practical point of view, every implementer will have to provide
some way to process the ACATS ACVC.
It is the intent to allow source representations, such as parse trees,
that are not even linear sequences of characters. It is also the intent
to allow different fonts: reserved words might be in bold face, and that
should be irrelevant to the semantics.
Extensions to Ada 83
18.b
{
extensions to Ada 83}
Ada
95 allows 8-bit and 16-bit characters, as well as implementation-specified
character sets.
Wording Changes from Ada 83
18.c/2
{
AI95-00285-01}
The syntax rules in this clause are modified to remove the emphasis on
basic characters vs. others. (In this day and age, there is no need to
point out that you can write programs without using (for example) lower
case letters.) In particular,
character (representing
all characters usable outside comments) is added, and
basic_graphic_character,
other_special_character, and
basic_character
are removed.
Special_character is expanded
to include Ada 83's
other_special_character,
as well as new 8-bit characters not present in Ada 83.
Ada 2005 removes special_character altogether;
we want to stick to ISO/IEC 10646:2003 character classifications.
Note that the term “basic letter” is used in
A.3,
“
Character Handling” to refer to
letters without diacritical marks.
18.d/2
{
AI95-00285-01}
Character names now come from
ISO/IEC 10646:2003 ISO
10646.
18.e/2
This paragraph
was deleted.{
AI95-00285-01}
We use identifier_letter
rather than letter since ISO 10646 BMP includes
many "letters' that are not permitted in identifiers (in the standard
mode).
Extensions to Ada 95
18.f/2
{
AI95-00285-01}
{
AI95-00395-01}
{extensions to Ada 95} Program
text can use most characters defined by ISO-10646:2003. This clause has
been rewritten to use the categories defined in that Standard. This should
ease programming in languages other than English.