Annotated Ada Reference ManualLegal Information
Contents   Index   References   Search   Previous   Next 

 A.4.11 String Encoding

1/3
{AI05-0137-2} Facilities for encoding, decoding, and converting strings in various character encoding schemes are provided by packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings, and Strings.UTF_Encoding.Wide_Wide_Strings.

Static Semantics

2/3
{AI05-0137-2} The encoding library packages have the following declarations:
3/3
{AI05-0137-2} package Ada.Strings.UTF_Encoding is
   pragma Pure (UTF_Encoding);
4/3
   -- Declarations common to the string encoding packages
   type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
5/3
   subtype UTF_String is String;
6/3
   subtype UTF_8_String is String;
7/3
   subtype UTF_16_Wide_String is Wide_String;
8/3
   Encoding_Error : exception;
9/3
   BOM_8    : constant UTF_8_String :=
                Character'Val(16#EF#) &
                Character'Val(16#BB#) &
                Character'Val(16#BF#);
10/3
   BOM_16BE : constant UTF_String :=
                Character'Val(16#FE#) &
                Character'Val(16#FF#);
11/3
   BOM_16LE : constant UTF_String :=
                Character'Val(16#FF#) &
                Character'Val(16#FE#);
12/3
   BOM_16   : constant UTF_16_Wide_String :=
               (1 => Wide_Character'Val(16#FEFF#));
13/3
   function Encoding (Item    : UTF_String;
                      Default : Encoding_Scheme := UTF_8)
   return Encoding_Scheme;
14/3
end Ada.Strings.UTF_Encoding;
15/3
{AI05-0137-2} package Ada.Strings.UTF_Encoding.Conversions is
   pragma Pure (Conversions);
16/3
   -- Conversions between various encoding schemes
   function Convert (Item          : UTF_String;
                     Input_Scheme  : Encoding_Scheme;
                     Output_Scheme : Encoding_Scheme;
                     Output_BOM    : Boolean := False) return UTF_String;
17/3
   function Convert (Item          : UTF_String;
                     Input_Scheme  : Encoding_Scheme;
                     Output_BOM    : Boolean := False)
      return UTF_16_Wide_String;
18/3
   function Convert (Item          : UTF_8_String;
                     Output_BOM    : Boolean := False)
      return UTF_16_Wide_String;
19/3
   function Convert (Item          : UTF_16_Wide_String;
                     Output_Scheme : Encoding_Scheme;
                     Output_BOM    : Boolean := False) return UTF_String;
20/3
   function Convert (Item          : UTF_16_Wide_String;
                     Output_BOM    : Boolean := False) return UTF_8_String;
21/3
end Ada.Strings.UTF_Encoding.Conversions;
22/3
{AI05-0137-2} package Ada.Strings.UTF_Encoding.Strings is
   pragma Pure (Strings);
23/3
   -- Encoding / decoding between String and various encoding schemes
   function Encode (Item          : String;
                    Output_Scheme : Encoding_Scheme;
                    Output_BOM    : Boolean  := False) return UTF_String;
24/3
   function Encode (Item       : String;
                    Output_BOM : Boolean  := False) return UTF_8_String;
25/3
   function Encode (Item       : String;
                    Output_BOM : Boolean  := False)
      return UTF_16_Wide_String;
26/3
   function Decode (Item         : UTF_String;
                    Input_Scheme : Encoding_Scheme) return String;
27/3
   function Decode (Item : UTF_8_String) return String;
28/3
   function Decode (Item : UTF_16_Wide_String) return String;
29/3
end Ada.Strings.UTF_Encoding.Strings;
30/3
{AI05-0137-2} package Ada.Strings.UTF_Encoding.Wide_Strings is
   pragma Pure (Wide_Strings);
31/3
   -- Encoding / decoding between Wide_String and various encoding schemes
   function Encode (Item          : Wide_String;
                    Output_Scheme : Encoding_Scheme;
                    Output_BOM    : Boolean  := False) return UTF_String;
32/3
   function Encode (Item       : Wide_String;
                    Output_BOM : Boolean  := False) return UTF_8_String;
33/3
   function Encode (Item       : Wide_String;
                    Output_BOM : Boolean  := False)
      return UTF_16_Wide_String;
34/3
   function Decode (Item         : UTF_String;
                    Input_Scheme : Encoding_Scheme) return Wide_String;
35/3
   function Decode (Item : UTF_8_String) return Wide_String;
36/3
   function Decode (Item : UTF_16_Wide_String) return Wide_String;
37/3
end Ada.Strings.UTF_Encoding.Wide_Strings;
38/3
{AI05-0137-2} package Ada.Strings.UTF_Encoding.Wide_Wide_Strings is
   pragma Pure (Wide_Wide_Strings);
39/3
   -- Encoding / decoding between Wide_Wide_String and various encoding schemes
   function Encode (Item          : Wide_Wide_String;
                    Output_Scheme : Encoding_Scheme;
                    Output_BOM    : Boolean  := False) return UTF_String;
40/3
   function Encode (Item       : Wide_Wide_String;
                    Output_BOM : Boolean  := False) return UTF_8_String;
41/3
   function Encode (Item       : Wide_Wide_String;
                    Output_BOM : Boolean  := False)
      return UTF_16_Wide_String;
42/3
   function Decode (Item         : UTF_String;
                    Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
43/3
   function Decode (Item : UTF_8_String) return Wide_Wide_String;
44/3
   function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
45/3
end Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
46/3
 {AI05-0137-2} {AI05-0262-1} The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 10646. UTF_16BE corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC 10646 in 8 bit, big-endian order; and UTF_16LE corresponds to the UTF-16 encoding scheme in 8 bit, little-endian order.
47/3
 {AI05-0137-2} The subtype UTF_String is used to represent a String of 8-bit values containing a sequence of values encoded in one of three ways (UTF-8, UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used to represent a String of 8-bit values containing a sequence of values encoded in UTF-8. The subtype UTF_16_Wide_String is used to represent a Wide_String of 16-bit values containing a sequence of values encoded in UTF-16.
48/3
 {AI05-0137-2} {AI05-0262-1} The BOM_8, BOM_16BE, BOM_16LE, and BOM_16 constants correspond to values used at the start of a string to indicate the encoding.
49/3
 {AI05-0137-2} {AI05-0262-1} Each of the Convert and Encode functions returns a UTF_String (respectively UTF_8_String and UTF_16_String) value whose characters have position values that correspond to the encoding of the Item parameter according to the encoding scheme required by the function or specified by its Output_Scheme parameter. For UTF_8, no overlong encoding is returned. A BOM is included at the start of the returned string if the Output_BOM parameter is set to True. The lower bound of the returned string is 1.
50/3
 {AI05-0262-1} Each of the Encode functions takes a String, Wide_String, or Wide_Wide_String Item parameter that is assumed to be an array of unencoded characters. Each of the Convert functions takes a UTF_String (respectively UTF_8_String and UTF_16_String) Item parameter that is assumed to contain characters whose position values correspond to a valid encoding sequence according to the encoding scheme required by the function or specified by its Input_Scheme parameter.
51/3
 {AI05-0137-2} {AI05-0262-1} Each of the Decode functions takes a UTF_String (respectively UTF_8_String and UTF_16_String) Item parameter which is assumed to contain characters whose position values correspond to a valid encoding sequence according to the encoding scheme required by the function or specified by its Input_Scheme parameter, and returns the corresponding String, Wide_String, or value. The lower bound of the returned string is 1.
52/3
 {AI05-0137-2} {AI05-0262-1} For each of the Convert and Decode functions, an initial BOM in the input that matches the expected encoding scheme is ignored, and a different initial BOM causes Encoding_Error to be propagated.
53/3
 {AI05-0137-2} The exception Encoding_Error is also propagated in the following situations: 
54/3
By a Decode function when a UTF encoded string contains an invalid encoding sequence.
55/3
By a Decode function when the expected encoding is UTF-16BE or UTF-16LE and the input string has an odd length.
56/3
{AI05-0262-1} By a Decode function yielding a String when the decoding of a sequence results in a code point whose value exceeds 16#FF#.
56.a/3
Discussion: We use "code point" here as that is what ISO 10646:2011 does and this text is directly referring to the contents of that standard; elsewhere in this Standard we have used "code position" to represent the same concept. 
57/3
By a Decode function yielding a Wide_String when the decoding of a sequence results in a code point whose value exceeds 16#FFFF#.
58/3
{AI05-0262-1} By an Encode function taking a Wide_String as input when an invalid character appears in the input. In particular, the characters whose position is in the range 16#D800# .. 16#DFFF# are invalid because they conflict with UTF-16 surrogate encodings, and the characters whose position is 16#FFFE# or 16#FFFF# are also invalid because they conflict with BOM codes. 
59/3
{AI05-0137-2} function Encoding (Item    : UTF_String;
                   Default : Encoding_Scheme := UTF_8)
   return Encoding_Scheme;
60/3
Inspects a UTF_String value to determine whether it starts with a BOM for UTF-8, UTF-16BE, or UTF_16LE. If so, returns the scheme corresponding to the BOM; returns the value of Default otherwise.
61/3
{AI05-0137-2} function Convert (Item          : UTF_String;
                  Input_Scheme  : Encoding_Scheme;
                  Output_Scheme : Encoding_Scheme;
                  Output_BOM    : Boolean := False) return UTF_String;
62/3
Returns the value of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme) encoded in one of these three schemes as specified by Output_Scheme.
63/3
{AI05-0137-2} function Convert (Item          : UTF_String;
                  Input_Scheme  : Encoding_Scheme;
                  Output_BOM    : Boolean := False)
   return UTF_16_Wide_String;
64/3
Returns the value of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme) encoded in UTF-16.
65/3
{AI05-0137-2} function Convert (Item          : UTF_8_String;
                  Output_BOM    : Boolean := False)
   return UTF_16_Wide_String;
66/3
Returns the value of Item (originally encoded in UTF-8) encoded in UTF-16.
67/3
{AI05-0137-2} function Convert (Item          : UTF_16_Wide_String;
                  Output_Scheme : Encoding_Scheme;
                  Output_BOM    : Boolean := False) return UTF_String;
68/3
Returns the value of Item (originally encoded in UTF-16) encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
69/3
{AI05-0137-2} function Convert (Item          : UTF_16_Wide_String;
                  Output_BOM    : Boolean := False) return UTF_8_String;
70/3
Returns the value of Item (originally encoded in UTF-16) encoded in UTF-8.
71/3
{AI05-0137-2} function Encode (Item          : String;
                 Output_Scheme : Encoding_Scheme;
                 Output_BOM    : Boolean  := False) return UTF_String;
72/3
{AI05-0262-1} Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
73/3
{AI05-0137-2} function Encode (Item       : String;
                 Output_BOM : Boolean  := False) return UTF_8_String;
74/3
Returns the value of Item encoded in UTF-8.
75/3
{AI05-0137-2} function Encode (Item       : String;
                 Output_BOM : Boolean  := False) return UTF_16_Wide_String;
76/3
Returns the value of Item encoded in UTF_16.
77/3
{AI05-0137-2} function Decode (Item         : UTF_String;
                 Input_Scheme : Encoding_Scheme) return String;
78/3
Returns the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme.
79/3
{AI05-0137-2} function Decode (Item : UTF_8_String) return String;
80/3
Returns the result of decoding Item, which is encoded in UTF-8.
81/3
{AI05-0137-2} function Decode (Item : UTF_16_Wide_String) return String;
82/3
Returns the result of decoding Item, which is encoded in UTF-16.
83/3
{AI05-0137-2} function Encode (Item          : Wide_String;
                 Output_Scheme : Encoding_Scheme;
                 Output_BOM    : Boolean  := False) return UTF_String;
84/3
{AI05-0262-1} Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
85/3
{AI05-0137-2} function Encode (Item       : Wide_String;
                 Output_BOM : Boolean  := False) return UTF_8_String;
86/3
Returns the value of Item encoded in UTF-8.
87/3
{AI05-0137-2} function Encode (Item       : Wide_String;
                 Output_BOM : Boolean  := False) return UTF_16_Wide_String;
88/3
Returns the value of Item encoded in UTF_16.
89/3
{AI05-0137-2} function Decode (Item         : UTF_String;
                 Input_Scheme : Encoding_Scheme) return Wide_String;
90/3
Returns the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme.
91/3
{AI05-0137-2} function Decode (Item : UTF_8_String) return Wide_String;
92/3
Returns the result of decoding Item, which is encoded in UTF-8, and returns the corresponding Wide_String value.
93/3
{AI05-0137-2} function Decode (Item : UTF_16_Wide_String) return Wide_String;
94/3
Returns the result of decoding Item, which is encoded in UTF-16.
95/3
{AI05-0137-2} function Encode (Item          : Wide_Wide_String;
                 Output_Scheme : Encoding_Scheme;
                 Output_BOM    : Boolean  := False) return UTF_String;
96/3
{AI05-0262-1} Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
97/3
{AI05-0137-2} function Encode (Item       : Wide_Wide_String;
                 Output_BOM : Boolean  := False) return UTF_8_String;
98/3
Returns the value of Item encoded in UTF-8.
99/3
{AI05-0137-2} function Encode (Item       : Wide_Wide_String;
                 Output_BOM : Boolean  := False) return UTF_16_Wide_String;
100/3
Returns the value of Item encoded in UTF_16.
101/3
{AI05-0137-2} function Decode (Item         : UTF_String;
                 Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
102/3
Returns the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme.
103/3
{AI05-0137-2} function Decode (Item : UTF_8_String) return Wide_Wide_String;
104/3
Returns the result of decoding Item, which is encoded in UTF-8, and returns the corresponding Wide_Wide_String value.
105/3
{AI05-0137-2} function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
106/3
Returns the result of decoding Item, which is encoded in UTF-16.

Implementation Advice

107/3
  {AI05-0137-2} If an implementation supports other encoding schemes, another similar child of Ada.Strings should be defined. 
107.a.1/3
Implementation Advice: If an implementation supports other string encoding schemes, a child of Ada.Strings similar to UTF_Encoding should be defined.
NOTES
108/3
17  {AI05-0137-2} A BOM (Byte-Order Mark, code position 16#FEFF#) can be included in a file or other entity to indicate the encoding; it is skipped when decoding. Typically, only the first line of a file or other entity contains a BOM. When decoding, the Encoding function can be called on the first line to determine the encoding; this encoding will then be used in subsequent calls to Decode to convert all of the lines to an internal format. 

Extensions to Ada 2005

108.a/3
{AI05-0137-2} The packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings, and Strings.UTF_Encoding.Wide_Wide_Strings are new. 

Contents   Index   References   Search   Previous   Next 
Ada-Europe Ada 2005 and 2012 Editions sponsored in part by Ada-Europe