This document is based on the definition found in the official ADA Reference Manual chapter 2.
The lexical will read a file in one of five formats defined as this:
ISO-8859-1
UCS-2 Little Endian
UCS-2 Big Endian
UCS-4 Little Endian
UCS-4 Big Endian
The format will be infer using the following algorithm:
Note that once the input file was read, all we will deal with are
wide strings. Thus, internally, the compiler will only see UCS-2
characters in the endian of the machine running the compiler.
There are several means which can be used to support additional
encodings. However, at this time it was decided that if additional
encodings had to be supported these should be converted before the
lexical reads the input file. This will very certainly be viewed as a
very valuable feature, but it won't be implemented in the lexical which
shouldn't deal with such character convertions.
The tool in question is very simply iconv. Like the other parts of
this compiler, it can be either used with pipes or to first transform a
set of characters to another. The output (-t option) should use the
native UCS-2 encoding for the CPU in use when running.
The current Unix iconv is very well documented and complete. This is
what we need to duplicate in ADA.
Of course, the ADA implementation will include a library to do all
the convertions and a very small program which can be used as a command
line tool.
One way to allow for different encodings within a file is to use the
encoding pragma. The main problem is that writing within a file how it
read it isn't really valid. Yet, many encodings use the latin letters
('a-z', 'A-Z') and most special characters (parenthesis, semi-colon,
equal sign, larger sign) can be recognized easilly to allow for early
definition of such a pragma:
pragma
encoding([name =>] encoding_identifier);
As we can see, the only characters necessary to encode this pragma
are: [a-zA-Z()=>;0-9_]. The encoding identifier could include digits
and underscore. The underscore character, however, could be removed
from this identifier for some encodings could not have it and yet have
all the other characters available.
This pragma could be searched by a tool running before the lexical
to determine the encoding if we are to support it.
This lexical will accept any format effector as a line separator
except the tabulation (HT) character.
Note that this is defined in this way because some compilers will
only support a limited number of characters on a single line. This
lexical and the entire compiler isn't limited to any number of line and
thus, knowing what marks the end of a line isn't of much importance
except to know how a comment will terminate.
This compiler doesn't bound the length of a line nor the length of
an identifier. Obviously, your system may not support too long an
identifier at link time in case you are using other compilers.
For compatibility reasons, you may also want to limit the length of
a line to 200 or less characters.
In order to force limits for compatibility reasons with other ADA
compilers, a pragma could be introduced to limit the length of an
identifier, a line and also strings and numbers. The following is
currently proposed. Until such a pragma is defined, the lengths are not
limited.
pragma
length_limits([identifier =>] decimal_value,
[line =>] decimal_value,
[string =>] decimal_value,
[number =>] decimal_value);
With support for really large numbers (million of digits) it could
be a good idea to limit their size in the text of an ADA program.
Note that only some of the lengths can be limited. Use the special unlimited keyword to get the
default behavior back (not that this is not a reserved keyword, it is
part of the pragma sementic to accept special keywords.) The decimal_value must be a string of
digits [0-9] only since these are being parsed by the lexical and not
the actual first level parser.
It is to be noted that only one of the following token* can be
used to start a valid Ada program:
This implies that the lexical can decide early on whether the input
file is worth dealing with. A proper error should however be printed
and this may not be easy to implement in the lexical which wouldn't
have a good view of what was intended and thus what to print out.
*
Separators are not viewed as tokens. These are read by the lexical and
not
transfered to the next level.
The output of the lexical is a list of tokens of the parsed input.
All the tokens are written in a DirectIO
stream one after another. The strings are saved using the UCS-2 encoding. It will be little
endian on little endian machines and big endian on big endian machines.
This means compiling accross nodes of a larger system will eventually
require a character swapping tool if different systems have different
endianess.
The structure of one token is defined below. Note that only a few
tokens (such as the identifier) actually use the string parameter. The
system won't save a string in the output stream if that token doesn't
require it. The list of token identifiers is used to recognize the
token. Note that all the single character operators are kept as such in
the enumeration.
type token_id is (One token is defined as a token identifier and a string of wide characters. Use the function has_string(t: token) return boolean; to know whether a token has a valid string. Note that it was chosen to always use a wide character string since with modern computers it still goes very fast to transfer such strings and it makes it much easy to deal with all cases without having to deal with a multibyte encoding all over the place.
unknown_tok,
identifier_tok,
integer_tok,
...
);
type token is
record
id: token_id;
name: wide_string;
end record;
The tokens will be used by the lexical and the 1st level parser. It
is important therefore to declare this structure in a package which
will later be available to the parser.
The output will be composed of one byte indicating the version and
endianess of the output file. Then all the tokens follow saved as a
token_id'output(t.id) and a wide_string'output(t.name);. Don't forget
that strings are saved only if necessary.
Because it is very important, the filename and line numbers must
also be passed down to the parser. This information will make it up to
the final binary file which will include it for later reference to the
source file via the use of a debugger and exception handling. This
information is transfered as the two special token
filename_tok
and line_tok
. Note that to not
overload the pipe between the lexical and the following level parsers,
we will save the line and column numbers in the first and second
character of the name parameter (in effect, this isn't really a name*). This is valid since it is
really unlikely that anyone would have an input file of more than 65536
lines or columns (I can't think of any compiler which could compile
such a large file without having some problems anyway!).
* Some characters
aren't valid either by themselves (well, it's D800 to DFFF, that
happens after 55295 lines of code?!?) and some are invalid according to
the Unicode organization. However, ADA 95 doesn't forbid them. I.e. a wide_character is defined as all
the numbers from 0 to FFFF.
1. |
Thought the definition says that no pragma nor any attribute shall be a reserved keyword, there are several attributes which use a keyword. For this reason, the lexical character apostrophe (') is not parsed by itself, but instead includes a string which is the identifier following it. |
2. |
Number are not converted by the lexical. In other words, numbers will be passed to the 1st level parser (and possibly other levels) as strings. This is important to be able to keep any presicion and size numbers until we clearly determine the destination type (also constants can remain strings and be automatically converted as required when necessary). |
Some programs will include the comments in the output file. I don't
see any interest in doing so since we have the line numbers, a good
debugger utility can show the source file and thus all the comments.
Therefore, all the comments found in the input files will all be
ignored at once by the lexer.
Comments can include all characters except the invalid ones: FFFE,
FFFF and the nul.
Note: |
The nul character could be
supported. It was removed mainly because it could interfer with the
inference of the file encoding type since zeroes are viewed as leading
zeroes of 8 or 16 bits extended characters. (security feature) |
The primary idea behind Ada 95 is to enable internationalization to
take place. The definition allows for implementation defined encodings,
however it defines the valid identifier characters as being only latin
letters. There is really no reason to restrict such an encoding of
identifiers except for compatibility between systems and compilers (for
instance, a C
compiler is still limited to [a-zA-Z0-9_]).
Thus, by default we will only accept latin letters, but the
internatinal identifier pragma can be used to allow any characters. For
this reason, the lexical will always accept all characters. This pragma
has to appear before any identifier using characters other than latin
letters.
IMPORTANT NOTE: | this is done in this way to conform to the specifications, I think the pragma shouldn't be necessary and all graphics characters should always be valid in identifiers. |
IMPORTANT NOTE 2: | it is likely that we will look into allowing the different characters which represent digits as digits (such as the arabic digits) and mathematical signs as mathematical signs (such as IN[cluded] and NOT IN[cluded]); thus only letters should be used in identifiers to ensure compatibility with future versions of this compiler. |
pragma
international_identifiers;
The following table shows all the characters in ISO-646-1 (row 00 of
ISO-10646-1).
The
cells marked in red present the characters which are forbidden (such
as controls). Forbidden character are caught by the lexical and the
compilation will be stopped unless these were found in a comment. [In regard to 2.1 (14)]
The characters marked in litght yellow are formatting controls
(format_effector) which are allowed in the standard text of an ADA
program.
The cells marked in green are valid latin letters which can be used
as characters in an identifier. Note that an identifier can't start
with the Low_Line (_) nor a digit (0-9). Also, no two Low_Line (_) can
follow each others in an identifier.
The characters marked in purple are among the characters which can
be found in the different operators accepted in an ADA program.
00 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
0 | ^@ NUL |
^A SOH |
^B STX |
^C ETX |
^D EOT |
^E ENQ |
^F ACK |
^G BEL |
^H BS |
^I HT |
^J LF |
^K VT |
^L FF |
^M CR |
^N SO |
^O SI |
1 | ^P DLE |
^Q DC1 |
^R DC2 |
^S DC3 |
^T DC4 |
^U NAK |
^V SYN |
^W ETB |
^X CAN |
^Y EM |
^Z SUB |
^[ ESC |
^\ FS |
^] GS |
^^ RS |
^_ US |
2 | <space> | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
3 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |
4 | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
5 | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
6 | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
7 | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | <del> |
8 | ~@ reserved |
~A reserved |
~B BPH |
~C NBH |
~D IND |
~E NEL |
~F SSA |
~G ESA |
~H HTS |
~I HTJ |
~J VTS |
~K PLD |
~L PLU |
~M RI |
~N SS2 |
~O SS3 |
9 | ~P DCS |
~Q PU1 |
~R PU2 |
~S STS |
~T CCH |
~U MW |
~V SPA |
~W EPA |
~X SOS |
~Y reserved |
~Z SCI |
~[ CSI |
~\ ST |
~] OSC |
~^ PM |
~_ APC |
A | <nbsp> | ¡ | ¢ | £ | ¤ | ¥ | ¦ | § | ¨ | © | ª | « | ¬ | <shy> | ® | ¯ |
B | ° | ± | ² | ³ | ´ | µ | ¶ | · | ¸ | ¹ | º | » | ¼ | ½ | ¾ | ¿ |
C |
À | Á | Â | Ã | Ä | Å | Æ | Ç | È | É | Ê | Ë | Ì | Í | Î | Ï |
D |
Ð | Ñ | Ò | Ó | Ô | Õ | Ö | × | Ø | Ù | Ú | Û | Ü | Ý | Þ | ß |
E | à | á | â | ã | ä | å | æ | ç | è | é | ê | ë | ì | í | î | ï |
F | ð | ñ | ò | ó | ô | õ | ö | ÷ | ø | ù | ú | û | ü | ý | þ | ÿ |
Nearly all the other characters (0100 to FFFD) are acceptable as
identifiers letters as long as the pragma international_identifier;
is used. Note that according to the Unicode consorcium, the characters
FFFE and FFFF are forbidden. This lexical also forbids them mainly
because it will look an error in the program text like any other text.
This lexer will understand the replacement characters for two
reasons: (1) for completeness and (2) because it doesn't prevent any of
the regular tokens to be used and defined properly. These replacements
are the ! (for |, the choice delimiter), % (for ", the strings
delimiters) and : (for # in numbers).
Note that the lexer will replace these characters right away so the
rest of the compiler doesn't have to deal with them.
This is the list of reserved keywords recognized by the lexer. The
two underlined are also used as attributes.
abort | else | new | return |
abs | elsif | not | reverse |
abstract | end | null | |
accept | entry | select | |
access | exception | of | separate |
aliased | exit | or | subtype |
all | others | ||
and | for | out | tagged |
array | function | task | |
at | package | terminate | |
generic | pragma | then | |
begin | goto | private | type |
body | procedure | ||
if | protected | until | |
case | in | use | |
constant | is | raise | |
range | when | ||
declare | limited | record | while |
delay | loop | rem | with |
delta | renames | ||
digits | mod | requeue | xor |
do |
We support based numeric literals with the # and : characters. Note
that the base is by default limited to 2 to 16. Use the pragma
large_based_numeric_literal to change the maximum to 36 instead. When
large based numeric literals are enabled, then they can include the
latin letters A to Z in uppercase or lowercase.
pragma
large_based_numeric_literal;