加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 百科 > 正文

中文的正则表达式

发布时间:2020-12-14 04:19:34 所属栏目:百科 来源:网络整理
导读:中文的正则 [u2E80-uFE4F]+ 现在网络上流行的是以下两个: /^[u0391-uFFE5]+$/ /^[u4E00-u9FA5]+$/ 明显,第二个的范围比较

中文的正则 [u2E80-uFE4F]+


现在网络上流行的是以下两个:
/^[u0391-uFFE5]+$/
/^[u4E00-u9FA5]+$/

明显,第二个的范围比较小。经过测试,第二个是不对的,第二个范围外的 'u9FA6' 是汉字 "囗",所以第二个明显没有包含所有必需的。
第一个的最后一个字符 'uFFE5' 是 ‘¥’ 字符,而 'uFFE6' 是 '?' 字符。所以我认为第一个是大体对的,不过第一个开头 ‘u0391’ 是 ''Α",但是奇怪的是这个不是英文的半角A也不是中文的全角 A,奇怪。 所以我觉得第一个的范围可能稍微偏大,特别是开始段。
于是去查 utf8 编码表
原来汉字编码是比较奇特的,并不是编在一起,比如希伯来文U+0590 -- U+05FF这么方便。汉字被分成了很多小段,而且因为有很多汉字是中国、日本、韩国共享的,所以UTF8编码里面的CJK一般都是指汉字段。
经过审查,第一次出现CJK的是 U+2E80, 最后一次是U+FE4F。因此最终结论是:
/^[u2E80-uFE4F]+$/
最后,再贴一下utf8码表
U+0000 -- U+007F: Basic Latin
U+0080 -- U+00FF: Latin-1 Supplement
U+0100 -- U+017F: Latin Extended-A
U+0180 -- U+024F: Latin Extended-B
U+0250 -- U+02AF: IPA Extensions
U+02B0 -- U+02FF: Spacing Modifier Letters
U+0300 -- U+036F: Combining Diacritical Marks
U+0370 -- U+03FF: Greek and Coptic
U+0400 -- U+04FF: Cyrillic
U+0500 -- U+052F: Cyrillic Supplement
U+0530 -- U+058F: Armenian
U+0590 -- U+05FF: Hebrew
U+0600 -- U+06FF: Arabic
U+0700 -- U+074F: Syriac
U+0750 -- U+077F: Arabic Supplement
U+0780 -- U+07BF: Thaana
U+07C0 -- U+07FF: NKo
U+0900 -- U+097F: Devanagari
U+0980 -- U+09FF: Bengali
U+0A00 -- U+0A7F: Gurmukhi
U+0A80 -- U+0AFF: Gujarati
U+0B00 -- U+0B7F: Oriya
U+0B80 -- U+0BFF: Tamil
U+0C00 -- U+0C7F: Telugu
U+0C80 -- U+0CFF: Kannada
U+0D00 -- U+0D7F: Malayalam
U+0D80 -- U+0DFF: Sinhala
U+0E00 -- U+0E7F: Thai
U+0E80 -- U+0EFF: Lao
U+0F00 -- U+0FFF: Tibetan
U+1000 -- U+109F: Myanmar
U+10A0 -- U+10FF: Georgian
U+1100 -- U+11FF: Hangul Jamo
U+1200 -- U+137F: Ethiopic
U+1380 -- U+139F: Ethiopic Supplement
U+13A0 -- U+13FF: Cherokee
U+1400 -- U+167F: Unified Canadian Aboriginal Syllabics
U+1680 -- U+169F: Ogham
U+16A0 -- U+16FF: Runic
U+1700 -- U+171F: Tagalog
U+1720 -- U+173F: Hanunoo
U+1740 -- U+175F: Buhid
U+1760 -- U+177F: Tagbanwa
U+1780 -- U+17FF: Khmer
U+1800 -- U+18AF: Mongolian
U+1900 -- U+194F: Limbu
U+1950 -- U+197F: Tai Le
U+1980 -- U+19DF: New Tai Lue
U+19E0 -- U+19FF: Khmer Symbols
U+1A00 -- U+1A1F: Buginese
U+1B00 -- U+1B7F: Balinese
U+1D00 -- U+1D7F: Phonetic Extensions
U+1D80 -- U+1DBF: Phonetic Extensions Supplement
U+1DC0 -- U+1DFF: Combining Diacritical Marks Supplement
U+1E00 -- U+1EFF: Latin Extended Additional
U+1F00 -- U+1FFF: Greek Extended
U+2000 -- U+206F: General Punctuation
U+2070 -- U+209F: Superscripts and Subscripts
U+20A0 -- U+20CF: Currency Symbols
U+20D0 -- U+20FF: Combining Diacritical Marks for Symbols
U+2100 -- U+214F: Letterlike Symbols
U+2150 -- U+218F: Number Forms
U+2190 -- U+21FF: Arrows
U+2200 -- U+22FF: Mathematical Operators
U+2300 -- U+23FF: Miscellaneous Technical
U+2400 -- U+243F: Control Pictures
U+2440 -- U+245F: Optical Character Recognition
U+2460 -- U+24FF: Enclosed Alphanumerics
U+2500 -- U+257F: Box Drawing
U+2580 -- U+259F: Block Elements
U+25A0 -- U+25FF: Geometric Shapes
U+2600 -- U+26FF: Miscellaneous Symbols
U+2700 -- U+27BF: Dingbats
U+27C0 -- U+27EF: Miscellaneous Mathematical Symbols-A
U+27F0 -- U+27FF: Supplemental Arrows-A
U+2800 -- U+28FF: Braille Patterns
U+2900 -- U+297F: Supplemental Arrows-B
U+2980 -- U+29FF: Miscellaneous Mathematical Symbols-B
U+2A00 -- U+2AFF: Supplemental Mathematical Operators
U+2B00 -- U+2BFF: Miscellaneous Symbols and Arrows
U+2C00 -- U+2C5F: Glagolitic
U+2C60 -- U+2C7F: Latin Extended-C
U+2C80 -- U+2CFF: Coptic
U+2D00 -- U+2D2F: Georgian Supplement
U+2D30 -- U+2D7F: Tifinagh
U+2D80 -- U+2DDF: Ethiopic Extended
U+2E00 -- U+2E7F: Supplemental Punctuation
U+2E80 -- U+2EFF: CJK Radicals Supplement
U+2F00 -- U+2FDF: Kangxi Radicals
U+2FF0 -- U+2FFF: Ideographic Description Characters
U+3000 -- U+303F: CJK Symbols and Punctuation
U+3040 -- U+309F: Hiragana
U+30A0 -- U+30FF: Katakana
U+3100 -- U+312F: Bopomofo
U+3130 -- U+318F: Hangul Compatibility Jamo
U+3190 -- U+319F: Kanbun
U+31A0 -- U+31BF: Bopomofo Extended
U+31C0 -- U+31EF: CJK Strokes
U+31F0 -- U+31FF: Katakana Phonetic Extensions
U+3200 -- U+32FF: Enclosed CJK Letters and Months
U+3300 -- U+33FF: CJK Compatibility
U+3400 -- U+4DBF: CJK Unified Ideographs Extension A
U+4DC0 -- U+4DFF: Yijing Hexagram Symbols
U+4E00 -- U+9FFF: CJK Unified Ideographs
U+A000 -- U+A48F: Yi Syllables
U+A490 -- U+A4CF: Yi Radicals
U+A700 -- U+A71F: Modifier Tone Letters
U+A720 -- U+A7FF: Latin Extended-D
U+A800 -- U+A82F: Syloti Nagri
U+A840 -- U+A87F: Phags-pa
U+AC00 -- U+D7AF: Hangul Syllables
U+D800 -- U+DB7F: High Surrogates
U+DB80 -- U+DBFF: High Private Use Surrogates
U+DC00 -- U+DFFF: Low Surrogates
U+E000 -- U+F8FF: Private Use Area
U+F900 -- U+FAFF: CJK Compatibility Ideographs
U+FB00 -- U+FB4F: Alphabetic Presentation Forms
U+FB50 -- U+FDFF: Arabic Presentation Forms-A
U+FE00 -- U+FE0F: Variation Selectors
U+FE10 -- U+FE1F: Vertical Forms
U+FE20 -- U+FE2F: Combining Half Marks
U+FE30 -- U+FE4F: CJK Compatibility Forms
U+FE50 -- U+FE6F: Small Form Variants
U+FE70 -- U+FEFF: Arabic Presentation Forms-B
U+FF00 -- U+FFEF: Halfwidth and Fullwidth Forms
U+FFF0 -- U+FFFF: Specials

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读