ADA Lexical

This document is based on the definition found in the official ADA Reference Manual chapter 2.

The lexical will read a file in one of five formats defined as this:

ISO-8859-1
UCS-2 Little Endian
UCS-2 Big Endian
UCS-4 Little Endian
UCS-4 Big Endian

The format will be infer using the following algorithm:

1. If the file has an odd size, use ISO-8859-1

2. If the file has a size which is not a multiple of 4, test the first two characters:

2.1 If they represent the value (FE, FF) or (00, ??) then use UCS-2 Big Endian
2.2 If they represent the value (FF, FE) or (??, 00) then use UCS-2 Little Endian
2.3 Otherwise use ISO-8859-1

3. If the file has a size which is a multiple of 4, then test the first four characters:

3.1 If the represent the value (00, 00, FE, FF) or (00, 00, 00, ??) then use UCS-4 Big Endian
3.2 If the represent the value (FF, FE, 00, 00) or (??, 00, 00, 00) then use UCS-4 Little Endian
3.3 Otherwise, test the first two characters as in point (2).

Note that once the input file was read, all we will deal with are wide strings. Thus, internally, the compiler will only see UCS-2 characters in the endian of the machine running the compiler.

There are several means which can be used to support additional encodings. However, at this time it was decided that if additional encodings had to be supported these should be converted before the lexical reads the input file. This will very certainly be viewed as a very valuable feature, but it won't be implemented in the lexical which shouldn't deal with such character convertions.

The tool in question is very simply iconv. Like the other parts of this compiler, it can be either used with pipes or to first transform a set of characters to another. The output (-t option) should use the native UCS-2 encoding for the CPU in use when running.

The current Unix iconv is very well documented and complete. This is what we need to duplicate in ADA.

Of course, the ADA implementation will include a library to do all the convertions and a very small program which can be used as a command line tool.

One way to allow for different encodings within a file is to use the encoding pragma. The main problem is that writing within a file how it read it isn't really valid. Yet, many encodings use the latin letters ('a-z', 'A-Z') and most special characters (parenthesis, semi-colon, equal sign, larger sign) can be recognized easilly to allow for early definition of such a pragma:

pragma encoding([name =>] encoding_identifier);

As we can see, the only characters necessary to encode this pragma are: [a-zA-Z()=>;0-9_]. The encoding identifier could include digits and underscore. The underscore character, however, could be removed from this identifier for some encodings could not have it and yet have all the other characters available.

This pragma could be searched by a tool running before the lexical to determine the encoding if we are to support it.

This lexical will accept any format effector as a line separator except the tabulation (HT) character.

Note that this is defined in this way because some compilers will only support a limited number of characters on a single line. This lexical and the entire compiler isn't limited to any number of line and thus, knowing what marks the end of a line isn't of much importance except to know how a comment will terminate.

This compiler doesn't bound the length of a line nor the length of an identifier. Obviously, your system may not support too long an identifier at link time in case you are using other compilers.

For compatibility reasons, you may also want to limit the length of a line to 200 or less characters.

In order to force limits for compatibility reasons with other ADA compilers, a pragma could be introduced to limit the length of an identifier, a line and also strings and numbers. The following is currently proposed. Until such a pragma is defined, the lengths are not limited.

pragma length_limits([identifier =>] decimal_value, [line =>] decimal_value, [string =>] decimal_value, [number =>] decimal_value);

With support for really large numbers (million of digits) it could be a good idea to limit their size in the text of an ADA program.

Note that only some of the lengths can be limited. Use the special unlimited keyword to get the default behavior back (not that this is not a reserved keyword, it is part of the pragma sementic to accept special keywords.) The decimal_value must be a string of digits [0-9] only since these are being parsed by the lexical and not the actual first level parser.

It is to be noted that only one of the following token* can be used to start a valid Ada program:

-- comment
function
generic
package
pragma
private
procedure
separate
with

This implies that the lexical can decide early on whether the input file is worth dealing with. A proper error should however be printed and this may not be easy to implement in the lexical which wouldn't have a good view of what was intended and thus what to print out.

* Separators are not viewed as tokens. These are read by the lexical and not transfered to the next level.

The output of the lexical is a list of tokens of the parsed input. All the tokens are written in a DirectIO stream one after another. The strings are saved using the UCS-2 encoding. It will be little endian on little endian machines and big endian on big endian machines. This means compiling accross nodes of a larger system will eventually require a character swapping tool if different systems have different endianess.

The structure of one token is defined below. Note that only a few tokens (such as the identifier) actually use the string parameter. The system won't save a string in the output stream if that token doesn't require it. The list of token identifiers is used to recognize the token. Note that all the single character operators are kept as such in the enumeration.

type token_id is (
unknown_tok,
identifier_tok,
integer_tok,
...
);
One token is defined as a token identifier and a string of wide characters. Use the function has_string(t: token) return boolean; to know whether a token has a valid string. Note that it was chosen to always use a wide character string since with modern computers it still goes very fast to transfer such strings and it makes it much easy to deal with all cases without having to deal with a multibyte encoding all over the place.
type token is
record
id: token_id;
name: wide_string;
end record;

The tokens will be used by the lexical and the 1st level parser. It is important therefore to declare this structure in a package which will later be available to the parser.

The output will be composed of one byte indicating the version and endianess of the output file. Then all the tokens follow saved as a token_id'output(t.id) and a wide_string'output(t.name);. Don't forget that strings are saved only if necessary.

Because it is very important, the filename and line numbers must also be passed down to the parser. This information will make it up to the final binary file which will include it for later reference to the source file via the use of a debugger and exception handling. This information is transfered as the two special token filename_tok and line_tok. Note that to not overload the pipe between the lexical and the following level parsers, we will save the line and column numbers in the first and second character of the name parameter (in effect, this isn't really a name*). This is valid since it is really unlikely that anyone would have an input file of more than 65536 lines or columns (I can't think of any compiler which could compile such a large file without having some problems anyway!).

* Some characters aren't valid either by themselves (well, it's D800 to DFFF, that happens after 55295 lines of code?!?) and some are invalid according to the Unicode organization. However, ADA 95 doesn't forbid them. I.e. a wide_character is defined as all the numbers from 0 to FFFF.

1.
Thought the definition says that no pragma nor any attribute shall be a reserved keyword, there are several attributes which use a keyword. For this reason, the lexical character apostrophe (') is not parsed by itself, but instead includes a string which is the identifier following it.
2.
Number are not converted by the lexical. In other words, numbers will be passed to the 1st level parser (and possibly other levels) as strings. This is important to be able to keep any presicion and size numbers until we clearly determine the destination type (also constants can remain strings and be automatically converted as required when necessary).

Some programs will include the comments in the output file. I don't see any interest in doing so since we have the line numbers, a good debugger utility can show the source file and thus all the comments. Therefore, all the comments found in the input files will all be ignored at once by the lexer.

Comments can include all characters except the invalid ones: FFFE, FFFF and the nul.

Note:
The nul character could be supported. It was removed mainly because it could interfer with the inference of the file encoding type since zeroes are viewed as leading zeroes of 8 or 16 bits extended characters. (security feature)


The primary idea behind Ada 95 is to enable internationalization to take place. The definition allows for implementation defined encodings, however it defines the valid identifier characters as being only latin letters. There is really no reason to restrict such an encoding of identifiers except for compatibility between systems and compilers (for instance, a C compiler is still limited to [a-zA-Z0-9_]).

Thus, by default we will only accept latin letters, but the internatinal identifier pragma can be used to allow any characters. For this reason, the lexical will always accept all characters. This pragma has to appear before any identifier using characters other than latin letters.

IMPORTANT NOTE: this is done in this way to conform to the specifications, I think the pragma shouldn't be necessary and all graphics characters should always be valid in identifiers.
IMPORTANT NOTE 2: it is likely that we will look into allowing the different characters which represent digits as digits (such as the arabic digits) and mathematical signs as mathematical signs (such as IN[cluded] and NOT IN[cluded]); thus only letters should be used in identifiers to ensure compatibility with future versions of this compiler.

pragma international_identifiers;

The following table shows all the characters in ISO-646-1 (row 00 of ISO-10646-1).

The cells marked in red present the characters which are forbidden (such as controls). Forbidden character are caught by the lexical and the compilation will be stopped unless these were found in a comment. [In regard to 2.1 (14)]

The characters marked in litght yellow are formatting controls (format_effector) which are allowed in the standard text of an ADA program.

The cells marked in green are valid latin letters which can be used as characters in an identifier. Note that an identifier can't start with the Low_Line (_) nor a digit (0-9). Also, no two Low_Line (_) can follow each others in an identifier.

The characters marked in purple are among the characters which can be found in the different operators accepted in an ADA program.

00 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 ^@
NUL
^A
SOH
^B
STX
^C
ETX
^D
EOT
^E
ENQ
^F
ACK
^G
BEL
^H
BS
^I
HT
^J
LF
^K
VT
^L
FF
^M
CR
^N
SO
^O
SI
1 ^P
DLE
^Q
DC1
^R
DC2
^S
DC3
^T
DC4
^U
NAK
^V
SYN
^W
ETB
^X
CAN
^Y
EM
^Z
SUB
^[
ESC
^\
FS
^]
GS
^^
RS
^_
US
2 <space> ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ <del>
8 ~@
reserved
~A
reserved
~B
BPH
~C
NBH
~D
IND
~E
NEL
~F
SSA
~G
ESA
~H
HTS
~I
HTJ
~J
VTS
~K
PLD
~L
PLU
~M
RI
~N
SS2
~O
SS3
9 ~P
DCS
~Q
PU1
~R
PU2
~S
STS
~T
CCH
~U
MW
~V
SPA
~W
EPA
~X
SOS
~Y
reserved
~Z
SCI
~[
CSI
~\
ST
~]
OSC
~^
PM
~_
APC
A <nbsp> ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ <shy> ® ¯
B ° ± ² ³ ´ µ · ¸ ¹ º » ¼ ½ ¾ ¿
C
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E à á â ã ä å æ ç è é ê ë ì í î ï
F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Nearly all the other characters (0100 to FFFD) are acceptable as identifiers letters as long as the pragma international_identifier; is used. Note that according to the Unicode consorcium, the characters FFFE and FFFF are forbidden. This lexical also forbids them mainly because it will look an error in the program text like any other text.

This lexer will understand the replacement characters for two reasons: (1) for completeness and (2) because it doesn't prevent any of the regular tokens to be used and defined properly. These replacements are the ! (for |, the choice delimiter), % (for ", the strings delimiters) and : (for # in numbers).

Note that the lexer will replace these characters right away so the rest of the compiler doesn't have to deal with them.

This is the list of reserved keywords recognized by the lexer. The two underlined are also used as attributes.

abort else new return
abs elsif not reverse
abstract end null  
accept entry   select
access exception of separate
aliased exit or subtype
all   others  
and for out tagged
array function   task
at   package terminate
  generic pragma then
begin goto private type
body   procedure  
  if protected until
case in   use
constant is raise  
    range when
declare limited record while
delay loop rem with
delta   renames  
digits mod requeue xor
do      

We support based numeric literals with the # and : characters. Note that the base is by default limited to 2 to 16. Use the pragma large_based_numeric_literal to change the maximum to 36 instead. When large based numeric literals are enabled, then they can include the latin letters A to Z in uppercase or lowercase.

pragma large_based_numeric_literal;