如何在ruby中创建一个带有“错误编码”的字符串？

发布时间：2020-12-17 03:31:32 所属栏目：百科来源：网络整理

导读：我在生产中的某个地方有一个我无法访问它的文件,当由 ruby脚本加载时,针对内容的正则表达式失败并带有ArgumentError = UTF-8中的无效字节序列. 我相信我有一个基于所有要点的答案：ruby 1.9: invalid byte sequence in UTF-8 # Remove all invalid and undef

我在生产中的某个地方有一个我无法访问它的文件,当由 ruby脚本加载时,针对内容的正则表达式失败并带有ArgumentError => UTF-8中的无效字节序列.

我相信我有一个基于所有要点的答案：ruby 1.9: invalid byte sequence in UTF-8

# Remove all invalid and undefined characters in the given string
# (ruby 1.9.3)
def safe_str str

  # edited based on matt's comment (thanks matt)
  s = str.encode('utf-16','utf-8',invalid: :replace,undef: :replace,replace: '')
  s.encode!('utf-8','utf-16')
end

但是,我现在想构建我的rspec以验证代码是否有效.我无法访问导致问题的文件,所以我想以编程方式创建一个带有错误编码的字符串.

我尝试过以下方面的变化：

bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.length.should > safe_str(bad_str).length

要么,

bad_str = (100..1000).to_a.pack(c*)
bad_str.length.should > safe_str(bad_str).length

但长度总是一样的.我也尝试过不同的角色范围;并不总是100到1000.

有关如何在ruby 1.9.3脚本中使用无效编码构建字符串的任何建议？

解决方法

你的safe_str方法(当前)实际上从不对字符串做任何事情,它是一个无操作. String#encode on Ruby 1.9.3 say的文档：

Please note that conversion from an encoding enc to the same encoding enc is a no-op,i.e. the receiver is returned without any changes,and no exceptions are raised,even if there are invalid bytes.

这对于当前版本的2.0.0(补丁级别247)来说是正确的,但是recent commit to Ruby trunk会对此进行更改,并且还会引入一种几乎可以满足您需求的擦除方法.

在发布新版本的Ruby之前,您需要将文本字符串往返到另一个编码并返回以清理它,如this answer to the question you linked to中的第二个示例所示,类似于：

def safe_str str
  s = str.encode('utf-16','utf-16')
end

请注意,尝试创建无效字符串的第一个示例将不起作用：

bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.valid_encoding? # => true

从<< docs开始：

If the object is a Integer,it is considered as a codepoint,and is converted to a character before concatenation.

所以你总是得到一个有效的字符串.

第二种方法,使用pack将创建一个编码为ASCII-8BIT的字符串.如果您使用force_encoding更改此设置,则可以使用无效编码创建UTF-8字符串：

bad_str = (100..1000).to_a.pack('c*').force_encoding('utf-8')
bad_str.valid_encoding? # => false

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!