过滤xml非法字符

发布时间：2020-12-16 09:04:12 所属栏目：百科来源：网络整理

导读：xml支持的字符范围 Character Range Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character,excluding the surrogate blocks,FFFE,and FFFF. */ any Unicode character,and FFFF. 意思是xml支持的字符

xml支持的字符范围

Character Range
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character,excluding the surrogate blocks,FFFE,and FFFF. */

any Unicode character,and FFFF.
意思是xml支持的字符范围是任何unicode字符，排除surrogate blocks(代理块),FFFE和 FFFF.

其中0xD800 至 0xDBFF（高代理high surrogate）和 0xDC00 至 0xDFFF（低代理low surrogate）被称为surrogate blocks（代理块）

代理块是为了表示增补字符增补字符是在 [#x10000-#x10FFFF] 范围之间的字符

增补字符是扩展16位unicode不能表示的字符。Unicode 最初设计是作为一种固定宽度的 16 位字符编码。16 位编码的所有 65，536 个字符并不能完全表示全世界所有正在使用或曾经使用的字符。于是，Unicode 标准扩展到包含多达 1，112，064 个字符，这些扩展字符就是增补字符。

xml中需要过滤的字符分为两类：

一类是不允许出现在xml中的字符，这些字符不在xml的定义范围之内。

另一类是xml自身要使用的字符，如果内容中有这些字符则需被替换成别的字符。

第一类字符

第二类字符
对于第二类字符一共有5个，如下：
字符 HTML字符字符编码
和(and) & & &
单引号 ’ ' '
双引号 ” " "
大于号 > > >
小于号 < < <
我们只需要对这个五个字符，进行相应的替换就可以了

相关代码：

可以利用.NET中 Regex的 Replace 方法对字符串中在这3个范围段的字符进行替换，如：
string content = “as fas fasfadfasdfasdf<234234546456″;
content = Regex.Replace(content,“[//x00-//x08//x0b-//x0c//x0e-//x1f]“,“*”);
Response.Write(content);

利用PB8，对这个范围的字符进行过滤如下:
string content = “as fas fasfadfasdfasdf<234234546456″;
int i_count_eliminate=30
char i_spechar_eliminate[]={“~001″,“~002″,&
“~003″,“~004″,“~005″,“~006″,“~007″,&
“~008″,“~011″,“~012″,“~014″,“~015″,&
“~016″,“~017″,“~018″,“~019″,“~020″,&
“~021″,“~022″,“~023″,“~024″,“~025″,&
“~026″,“~027″,“~028″,“~029″,“~030″,&
“~031″,‘”‘,“`” } //需要消除的字符,将直接替换为空
for vi=1 to i_count_eliminate
vpos=1
vlen=lenw(i_spechar_eliminate[vi])
do while true
vpos = posw(content,i_spechar_eliminate[vi],vpos)
if vpos<1 then exit
content=replacew(content,vpos,vlen,”")
loop
next

STL中可以这样处理：
string filter_xml_marks(string in)
{
string out;
for(unsigned int i=0 ; i<in.length(); i++)
{
if(in[i] == '&')
{
out += "&";
continue;
}
else if(in[i] == '/'')
{
out += "'";
continue;
}
else if(in[i] == '/"')
{
out += """;
continue;
}
else if(in[i] == '<')
{
out += "<";
continue;
}
else if(in[i] == '>')
{
out += ">";
continue;
}
else if((in[i]>= 0x00 &&in[i]<=0x08)||(in[i]>=0x0b&&in[i]<=0x0c)||(in[i]>=0x0e&&in[i]<=0x1f))
continue;

out += in[i];
}

return out;
}

XMLCheck用于检查xml文件中包含非法xml字符的个数。

使用方法 XMLCheck filename

import java.io.*;

public class XMLCheck {

/**
* @author lxn
*
*/
public static void main(String[] args) throws IOException{

if(args.length == 0)
{
System.out.print("Usage: XMLCheck filename");
return;
}

File xmlFile = new File(args[0]);
if(!xmlFile.exists())
{
System.out.print("File not exist");
return;
}

//输入xml文件
BufferedReader in = new BufferedReader(new FileReader(xmlFile));
String s;
StringBuilder xmlSb = new StringBuilder();
//xml文件转换成String
while((s = in.readLine())!=null)
xmlSb.append(s+"/n");
in.close();
String xmlString = xmlSb.toString();
// TODO Auto-generated method stub
//无特殊字符的
//int i = checkCharacterData("<?xml version=/"1.0/" encoding=/"gbk/"?><CC>卡号</CC>");
//有特殊字符的
//int i = checkCharacterData("<?xml version=/"1.0/" encoding=/"gbk/"?><CC>/u001E卡号</CC>");

int errorChar = checkCharacterData(xmlString);
System.out.println("This XML　file contain "+errorChar+" errorChar.");
}

//判断字符串中是否有非法字符
public static int checkCharacterData(String text){
int errorChar=0;
if(text==null){
return errorChar;
}
char[] data = text.toCharArray();
for(int i=0,len=data.length;i<len;i++){
char c = data[i];
int result=c;
//先判断是否在代理范围（surrogate blocks）
//增补字符编码为两个代码单元，
//第一个单元来自于高代理（high surrogate）范围（0xD800 至 0xDBFF），
//第二个单元来自于低代理（low surrogate）范围（0xDC00 至 0xDFFF）。
if(result>=0xD800 && result<=0xDBFF){
//解码代理对（surrogate pair）
int high = c;
try{
int low=text.charAt(i+1);

if(low<0xDC00||low>0xDFFF){
char ch=(char)low;
}
//unicode说明定义的算法计算出增补字符范围0x10000 至 0x10FFFF
//即若result是增补字符集，应该在0x10000到0x10FFFF之间，isXMLCharacter中有判断
result = (high-0xD800)*0x400+(low-0xDC00)+0x10000;
i++;
}
catch(Exception e){
e.printStackTrace();
}
}
if(!isXMLCharacter(result)){
errorChar++;
}
}
return errorChar;
}
private static boolean isXMLCharacter(int c){
//根据xml规范中的Character Range检测xml不支持的字符
if(c <= 0xD7FF){
if(c >= 0x20)return true;
else{
if (c == '/n') return true;
if (c == '/r') return true;
if (c == '/t') return true;
return false;
}
}
if (c < 0xE000) return false; if (c <= 0xFFFD) return true;
if (c < 0x10000) return false; if (c <= 0x10FFFF) return true;
return false;
}

}

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!