加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 百科 > 正文

正则表达式 – 具有不对称大小写的Unicode字符.为什么?

发布时间:2020-12-13 22:55:45 所属栏目:百科 来源:网络整理
导读:为什么以下三个字符不对称toLower,toUpper结果 /** * Written in the Scala programming language,typed into the Scala REPL. * Results commented accordingly. *//* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */'u1e9e'.toHexString =
为什么以下三个字符不对称toLower,toUpper结果
/**
  * Written in the Scala programming language,typed into the Scala REPL.
  * Results commented accordingly.
  */
/* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */
'u1e9e'.toHexString == "1e9e" // true
'u1e9e'.toLower.toHexString == "df" // "df" == "df"
'u1e9e'.toHexString == 'u1e9e'.toLower.toUpper.toHexString // "1e9e" != "df"
/* Unicode Character 'KELVIN SIGN' (U+212A) */
'u212a'.toHexString == "212a" // "212a" == "212a"
'u212a'.toLower.toHexString == "6b" // "6b" == "6b"
'u212a'.toHexString == 'u212a'.toLower.toUpper.toHexString // "212a" != "4b"
/* Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130) */
'u0130'.toHexString == "130" // "130" == "130"
'u0130'.toLower.toHexString == "69" // "69" == "69"
'u0130'.toHexString == 'u0130'.toLower.toUpper.toHexString // "130" != "49"
对于第一个,有 this explanation:

In the German language,the Sharp S (“?” or U+00df) is a lowercase letter,and it capitalizes to the letters “SS”.

换句话说,U 1E9E小写为U 00DF,但大写U 00DF不是U 1E9E.

对于第二个,U 212A(KELVIN SIGN)小写为U 0068(拉丁小写字母K). U 0068的大写字母是U 004B(LATIN CAPITAL LETTER K).这个似乎对我有意义.

对于第三种情况,U 0130(LATIN CAPITAL LETTER I WITH DOT ABOVE)是土耳其/阿塞拜疆人,小写为U 0069(拉丁小姐I).我想象,如果你在某个土耳其/阿塞拜疆地区不知何故,你会得到正确的大写版本的U 0069,但这可能不一定是普遍的.

字符不一定具有对称的大小写变换.

编辑:为了回应PhiLho在下面的评论,Unicode 6.0 spec有关于U 212A(KELVIN SIGN)的说法:

Three letterlike symbols have been given canonical equivalence to regular letters: U+2126
OHM SIGN,U+212A KELVIN SIGN,and U+212B ANGSTROM SIGN. In all three instances,the regular letter should be used. If text is normalized according to Unicode Standard Annex #15,“Unicode Normalization Forms,” these three characters will be replaced by their regular equivalents.

换句话说,你不应该真的使用U 212A,而应该使用U 004B(LATIN CAPITAL LETTER K),如果你规范化Unicode文本,U 212A应该被替换为U 004B.

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读