加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 百科 > 正文

使用LXML编写XML标头

发布时间:2020-12-16 23:14:18 所属栏目:百科 来源:网络整理
导读:我目前正在编写一个脚本,将一堆 XML文件从各种编码转换为统一的UTF-8. 我首先尝试使用LXML确定编码: def get_source_encoding(self): tree = etree.parse(self.inputfile) encoding = tree.docinfo.encoding self.inputfile.seek(0) return (encoding or ''
我目前正在编写一个脚本,将一堆 XML文件从各种编码转换为统一的UTF-8.

我首先尝试使用LXML确定编码:

def get_source_encoding(self):
    tree = etree.parse(self.inputfile)
    encoding = tree.docinfo.encoding
    self.inputfile.seek(0)
    return (encoding or '').lower()

如果那是空白的,我尝试从chardet获取它:

def guess_source_encoding(self):
    chunk = self.inputfile.read(1024 * 10)
    self.inputfile.seek(0)
    return chardet.detect(chunk).lower()

然后我使用编解码器转换文件的编码:

def convert_encoding(self,source_encoding,input_filename,output_filename):
    chunk_size = 16 * 1024

    with codecs.open(input_filename,"rb",source_encoding) as source:
        with codecs.open(output_filename,"wb","utf-8") as destination:
            while True:
                chunk = source.read(chunk_size)

                if not chunk:
                    break;

                destination.write(chunk)

最后,我正在尝试重写XML标头.如果最初是XML标头

<?xml version="1.0"?>

要么

<?xml version="1.0" encoding="windows-1255"?>

我想把它变成

<?xml version="1.0" encoding="UTF-8"?>

我目前的代码似乎不起作用:

def edit_header(self,input_filename):
    output_filename = tempfile.mktemp(suffix=".xml")

    with open(input_filename,"rb") as source:
        parser = etree.XMLParser(encoding="UTF-8")
        tree = etree.parse(source,parser)

        with open(output_filename,"wb") as destination:
            tree.write(destination,encoding="UTF-8")

我正在测试的文件有一个没有指定编码的标头.如何使用指定的编码正确输出标题?

解决方法

尝试:

tree.write(destination,xml_declaration=True,encoding='UTF-8')

从the API docs开始:

xml_declaration controls if an XML declaration should be added to the file. Use False for never,True for always,None for only if not US-ASCII or UTF-8 (default is None).

来自ipython的示例:

In [15]:  etree.ElementTree(etree.XML('<hi/>')).write(sys.stdout,encoding='UTF-8')
<?xml version='1.0' encoding='UTF-8'?>
<hi/>

经过反思,我觉得你太努力了. lxml会自动检测编码并根据该编码正确解析文件.

所以你真正要做的事情(至少在Python2.7中)是:

def convert_encoding(self,output_filename):
    tree = etree.parse(input_filename)
    with open(output_filename,'w') as destination:
        tree.write(destination,encoding='utf-8',xml_declaration=True)

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读