A.4.11 String Encoding
1/3
{
AI05-0137-2}
Facilities for encoding, decoding, and converting
strings in various character encoding schemes are provided by packages
Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings,
Strings.UTF_Encoding.Wide_Strings, and Strings.UTF_Encoding.Wide_Wide_Strings.
Static Semantics
2/3
{
AI05-0137-2}
The encoding library packages have the following
declarations:
3/3
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding is
pragma Pure (UTF_Encoding);
4/3
-- Declarations common to the string encoding packages
type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
5/3
subtype UTF_String is String;
6/3
subtype UTF_8_String is String;
7/3
subtype UTF_16_Wide_String is Wide_String;
8/3
Encoding_Error : exception;
9/3
BOM_8 : constant UTF_8_String :=
Character'Val(16#EF#) &
Character'Val(16#BB#) &
Character'Val(16#BF#);
10/3
BOM_16BE : constant UTF_String :=
Character'Val(16#FE#) &
Character'Val(16#FF#);
11/3
BOM_16LE : constant UTF_String :=
Character'Val(16#FF#) &
Character'Val(16#FE#);
12/3
BOM_16 : constant UTF_16_Wide_String :=
(1 => Wide_Character'Val(16#FEFF#));
13/3
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
14/3
end Ada.Strings.UTF_Encoding;
15/3
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Conversions is
pragma Pure (Conversions);
16/3
-- Conversions between various encoding schemes
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
17/3
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
18/3
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
19/3
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
20/3
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
21/3
end Ada.Strings.UTF_Encoding.Conversions;
22/3
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Strings is
pragma Pure (Strings);
23/3
-- Encoding / decoding between String and various encoding schemes
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
24/3
function Encode (Item : String;
Output_BOM : Boolean := False) return UTF_8_String;
25/3
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
26/3
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return String;
27/3
function Decode (Item : UTF_8_String) return String;
28/3
function Decode (Item : UTF_16_Wide_String) return String;
29/3
end Ada.Strings.UTF_Encoding.Strings;
30/3
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Wide_Strings is
pragma Pure (Wide_Strings);
31/3
-- Encoding / decoding between Wide_String and various encoding schemes
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
32/3
function Encode (Item : Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
33/3
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
34/3
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return Wide_String;
35/3
function Decode (Item : UTF_8_String) return Wide_String;
36/3
function Decode (Item : UTF_16_Wide_String) return Wide_String;
37/3
end Ada.Strings.UTF_Encoding.Wide_Strings;
38/3
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Wide_Wide_Strings is
pragma Pure (Wide_Wide_Strings);
39/3
-- Encoding / decoding between Wide_Wide_String and various encoding schemes
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
40/3
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
41/3
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
42/3
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
43/3
function Decode (Item : UTF_8_String) return Wide_Wide_String;
44/3
function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
45/3
end Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
46/3
{
AI05-0137-2}
{
AI05-0262-1}
The type Encoding_Scheme defines encoding schemes.
UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of
ISO/IEC 10646. UTF_16BE corresponds to the UTF-16 encoding scheme defined
by Annex C of ISO/IEC 10646 in 8 bit, big-endian order; and UTF_16LE
corresponds to the UTF-16 encoding scheme in 8 bit, little-endian order.
47/3
{
AI05-0137-2}
The subtype UTF_String is used to represent a String
of 8-bit values containing a sequence of values encoded in one of three
ways (UTF-8, UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used
to represent a String of 8-bit values containing a sequence of values
encoded in UTF-8. The subtype UTF_16_Wide_String is used to represent
a Wide_String of 16-bit values containing a sequence of values encoded
in UTF-16.
48/3
{
AI05-0137-2}
{
AI05-0262-1}
The BOM_8, BOM_16BE, BOM_16LE, and BOM_16 constants
correspond to values used at the start of a string to indicate the encoding.
49/3
{
AI05-0137-2}
{
AI05-0262-1}
Each of the Convert and Encode functions returns
a UTF_String (respectively UTF_8_String and UTF_16_String) value whose
characters have position values that correspond to the encoding of the
Item parameter according to the encoding scheme required by the function
or specified by its Output_Scheme parameter. For UTF_8, no overlong encoding
is returned. A BOM is included at the start of the returned string if
the Output_BOM parameter is set to True. The lower bound of the returned
string is 1.
50/3
{
AI05-0262-1}
Each of the Encode functions takes a String, Wide_String,
or Wide_Wide_String Item parameter that is assumed to be an array of
unencoded characters. Each of the Convert functions takes a UTF_String
(respectively UTF_8_String and UTF_16_String) Item parameter that is
assumed to contain characters whose position values correspond to a valid
encoding sequence according to the encoding scheme required by the function
or specified by its Input_Scheme parameter.
51/3
{
AI05-0137-2}
{
AI05-0262-1}
Each of the Decode functions takes a UTF_String
(respectively UTF_8_String and UTF_16_String) Item parameter which is
assumed to contain characters whose position values correspond to a valid
encoding sequence according to the encoding scheme required by the function
or specified by its Input_Scheme parameter, and returns the corresponding
String, Wide_String, or value. The lower bound of the returned string
is 1.
52/3
{
AI05-0137-2}
{
AI05-0262-1}
For each of the Convert and Decode functions, an
initial BOM in the input that matches the expected encoding scheme is
ignored, and a different initial BOM causes Encoding_Error to be propagated.
53/3
{
AI05-0137-2}
The exception Encoding_Error is also propagated
in the following situations:
54/3
By a Decode function when
a UTF encoded string contains an invalid encoding sequence.
55/3
By a Decode function when
the expected encoding is UTF-16BE or UTF-16LE and the input string has
an odd length.
56/3
{
AI05-0262-1}
By a Decode function yielding a String when the
decoding of a sequence results in a code point whose value exceeds 16#FF#.
56.a/3
Discussion: We
use "code point" here as that is what ISO 10646:2011 does and
this text is directly referring to the contents of that standard; elsewhere
in this Standard we have used "code position" to represent
the same concept.
57/3
By a Decode function yielding
a Wide_String when the decoding of a sequence results in a code point
whose value exceeds 16#FFFF#.
58/3
{
AI05-0262-1}
By an Encode function taking a Wide_String as input
when an invalid character appears in the input. In particular, the characters
whose position is in the range 16#D800# .. 16#DFFF# are invalid because
they conflict with UTF-16 surrogate encodings, and the characters whose
position is 16#FFFE# or 16#FFFF# are also invalid because they conflict
with BOM codes.
59/3
{
AI05-0137-2}
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
60/3
Inspects
a UTF_String value to determine whether it starts with a BOM for UTF-8,
UTF-16BE, or UTF_16LE. If so, returns the scheme corresponding to the
BOM; returns the value of Default otherwise.
61/3
{
AI05-0137-2}
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
62/3
Returns
the value of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE
as specified by Input_Scheme) encoded in one of these three schemes as
specified by Output_Scheme.
63/3
{
AI05-0137-2}
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
64/3
Returns
the value of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE
as specified by Input_Scheme) encoded in UTF-16.
65/3
{
AI05-0137-2}
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
66/3
Returns
the value of Item (originally encoded in UTF-8) encoded in UTF-16.
67/3
{
AI05-0137-2}
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
68/3
Returns
the value of Item (originally encoded in UTF-16) encoded in UTF-8, UTF-16LE,
or UTF-16BE as specified by Output_Scheme.
69/3
{
AI05-0137-2}
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
70/3
Returns
the value of Item (originally encoded in UTF-16) encoded in UTF-8.
71/3
{
AI05-0137-2}
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
72/3
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE,
or UTF-16BE as specified by Output_Scheme.
73/3
{
AI05-0137-2}
function Encode (Item : String;
Output_BOM : Boolean := False) return UTF_8_String;
74/3
Returns
the value of Item encoded in UTF-8.
75/3
{
AI05-0137-2}
function Encode (Item : String;
Output_BOM : Boolean := False) return UTF_16_Wide_String;
76/3
Returns
the value of Item encoded in UTF_16.
77/3
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return String;
78/3
Returns
the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or
UTF-16BE as specified by Input_Scheme.
79/3
{
AI05-0137-2}
function Decode (Item : UTF_8_String) return String;
80/3
Returns
the result of decoding Item, which is encoded in UTF-8.
81/3
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String) return String;
82/3
Returns
the result of decoding Item, which is encoded in UTF-16.
83/3
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
84/3
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE,
or UTF-16BE as specified by Output_Scheme.
85/3
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
86/3
Returns
the value of Item encoded in UTF-8.
87/3
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_BOM : Boolean := False) return UTF_16_Wide_String;
88/3
Returns
the value of Item encoded in UTF_16.
89/3
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return Wide_String;
90/3
Returns
the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or
UTF-16BE as specified by Input_Scheme.
91/3
{
AI05-0137-2}
function Decode (Item : UTF_8_String) return Wide_String;
92/3
Returns
the result of decoding Item, which is encoded in UTF-8, and returns the
corresponding Wide_String value.
93/3
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String) return Wide_String;
94/3
Returns
the result of decoding Item, which is encoded in UTF-16.
95/3
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
96/3
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE,
or UTF-16BE as specified by Output_Scheme.
97/3
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
98/3
Returns
the value of Item encoded in UTF-8.
99/3
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False) return UTF_16_Wide_String;
100/3
Returns
the value of Item encoded in UTF_16.
101/3
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
102/3
Returns
the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or
UTF-16BE as specified by Input_Scheme.
103/3
{
AI05-0137-2}
function Decode (Item : UTF_8_String) return Wide_Wide_String;
104/3
Returns
the result of decoding Item, which is encoded in UTF-8, and returns the
corresponding Wide_Wide_String value.
105/3
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
106/3
Returns
the result of decoding Item, which is encoded in UTF-16.
Implementation Advice
107/3
{
AI05-0137-2}
If an implementation supports other encoding schemes,
another similar child of Ada.Strings should be defined.
107.a.1/3
Implementation Advice:
If an implementation supports other
string encoding schemes, a child of Ada.Strings similar to UTF_Encoding
should be defined.
108/3
17 {
AI05-0137-2}
A BOM (Byte-Order Mark, code position 16#FEFF#)
can be included in a file or other entity to indicate the encoding; it
is skipped when decoding. Typically, only the first line of a file or
other entity contains a BOM. When decoding, the Encoding function can
be called on the first line to determine the encoding; this encoding
will then be used in subsequent calls to Decode to convert all of the
lines to an internal format.
Extensions to Ada 2005
108.a/3
{
AI05-0137-2}
The packages Strings.UTF_Encoding,
Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings,
and Strings.UTF_Encoding.Wide_Wide_Strings are new.
Ada 2005 and 2012 Editions sponsored in part by Ada-Europe