XML的默认编码是UTF-8或UTF-16？

发布时间：2020-12-16 08:00:46 所属栏目：百科来源：网络整理

导读：OpenTag FAQ状态： If no encoding declaration is present in the XML document (and no external encoding declaration mechanism such as the HTTP header is available),the assumed encoding of an XML document depends on the presence of the Byte-O

OpenTag FAQ状态：

If no encoding declaration is present
in the XML document (and no external
encoding declaration mechanism such as
the HTTP header is available),the
assumed encoding of an XML document
depends on the presence of the
Byte-Order-Mark (BOM).

The BOM is a Unicode special marker
placed at the top of the file that
indicate its encoding. The BOM is
optional for UTF-8.

06000

是否有上述段落的解释？

你必须使用一行

<?xml version="1.0" encoding="iso-8859-1" ?>

指定使用哪个编码。如果未指定编码，则可以存在Byte order mark (BOM)。如果存在UTF-16或UTF-32的BOM，则使用该编码。否则UTF-8是编码。 (UTF-8的BOM是可选的)

编辑

BOM是一个看不见的人物。但是没有必要看到它。应用程序自动处理。当您使用Windows记事本时，您可以在保存文件时选择编码。记事本将自动在文件开头插入BOM。当您稍后重新打开文件时，记事本将识别BOM并使用正确的编码来读取文件。没有必要修改BOM，如果你这样做，字符可以得到不同的含义，所以文本将不一样。

我会尝试用一个例子来解释。考虑一个文本文件，只有字符“测试”。默认记事本将使用ANSI编码，当您在hex mode中查看时，文本文件将如下所示：

C:&;C:gnuwin32binhexdump -C test-ansi.txt
00000000  74 65 73 74                                       |test|
00000004

(正如你所看到的，我使用的是gnuwin32的hexdump，但是你也可以使用像Frhed这样的十六进制编辑器来看这个。

此文件前面没有BOM。这是不可能的，因为用于BOM的字符不存在于ANSI编码中。 (因为没有BOM，不支持ANSI编码的编辑器会将该文件视为UTF-8)。

当我现在保存文件像utf8，你会看到3个额外的字节(BOM)在“test”前面：

C:&;C:gnuwin32binhexdump -C test-utf8.txt
00000000  ef bb bf 74 65 73 74                              |???test|
00000007

(如果您使用不支持utf-8的文本编辑器打开此文件，您将看到这些字符“?”?“)

记事本也可以将文件保存为unicode，这意味着UTF-16的little-endian(UTF-16LE)：

C:&;C:gnuwin32binhexdump -C test-unicode.txt
00000000  ff fe 74 00 65 00 73 00  74 00                    |?tt.e.s.t.|
0000000a

而这里的版本保存为unicode(big endian)(UTF-16BE)：

C:&;C:gnuwin32binhexdump -C test-unicode-big-endian.txt
00000000  fe ff 00 74 00 65 00 73  00 74                    |t?.t.e.s.t|
0000000a

现在考虑一个带有4个汉字“琀攀猀琀”的文本文件。当我把它保存为unicode(big endian)时，结果如下所示：

C:&;C:gnuwin32binhexdump -C test2-unicode-big-endian.txt
00000000  fe ff 74 00 65 00 73 00  74 00                    |t?t.e.s.t.|
0000000a

如您所见，UTF-16LE中的“测试”一词与UTF-16BE中的“琀攀猀琀”相同。但是由于BOM如果存储不同，可以看到该文件是否包含“test”或“琀攀猀琀”。没有BOM，你不得不猜测。

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!