python – UTF-8中的中文字符的上限和下限是多少?
发布时间:2020-12-20 12:19:28 所属栏目:Python 来源:网络整理
导读:我想在 python中创建一个包含所有orr()的中文字符: 对于英语,相当于: english = set(range(ord('a'),ord('z') + 1 ) + range(ord('A'),ord('Z') + 1 )) 解决方法 从Unicode标准(v6.0,第12.1节), Han ideographic characters are found in seven main block
我想在
python中创建一个包含所有orr()的中文字符:
对于英语,相当于: english = set(range(ord('a'),ord('z') + 1 ) + range(ord('A'),ord('Z') + 1 )) 解决方法
从Unicode标准(v6.0,第12.1节),
Table 12-2. Blocks Containing Han Ideographs Block | Range | Comment ----------------------------------------+-------------+----------------------------------------------------- CJK Unified Ideographs | 4E00–9FFF | Common CJK Unified Ideographs Extension A | 3400–4DBF | Rare CJK Unified Ideographs Extension B | 20000–2A6DF | Rare,historic CJK Unified Ideographs Extension C | 2A700–2B73F | Rare,historic CJK Unified Ideographs Extension D | 2B740–2B81F | Uncommon,some in current use CJK Compatibility Ideographs | F900–FAFF | Duplicates,unifiable variants,corporate characters CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants 除了这些块之外还有一些额外的东西: Table 12-3. Small Extensions to the URO Range | Version | Comment ----------+---------+------------------------------------------------- 9FA6–9FB3 | 4.1 | Interoperability with HKSCS standard 9FB4–9FBB | 4.1 | Interoperability with GB 18030 standard 9FBC–9FC2 | 5.1 | Interoperability with commercial implementations 9FC3 | 5.1 | Correction of mistaken unification 9FC4–9FC6 | 5.2 | Interoperability with ARIB standard 9FC7–9FCB | 5.2 | Interoperability with HKSCS standard 要使用set操作构造一组这些的序数值,您可以这样做: chinese = set(range(0x4E00,0xA000) + range(0x3400,0x4DC0) + range(0x20000,0x2A6E0) + range(0x2A700,0x2B740) + range(0x2B740,0x2B820) + range(0xF900,0xFB00) + range(0x2F800,0x2FA20) + range(0x9FA6,0x9FCC)) 但请注意,此集包含超过75000个字符,因此它可能不是最紧凑或最有效的数据结构. 此外,如果您坚持在文字字符上使用ord(),则需要使用32位unicode文字形式: >>> ord(u'U00002F800') 194560 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |