加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 百科 > 正文

正则表达式模式匹配中的错误,用于将文本检索到数据帧的两列中

发布时间:2020-12-14 06:03:43 所属栏目:百科 来源:网络整理
导读:考虑以下假设数据: x - "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number)
考虑以下假设数据:

x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"


y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : 
If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one. : And the data frame has a single row,the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

df <- data.frame(Text = c(x,y,z),row.names = NULL,stringsAsFactors = F)

您是否注意到在不同位置有“:”.例如:

>在’x’中它(“:”)在第一句之后.
>在’y’中它(“:”)在第四句之后.
>并且在’z’中它是在第六句之后.
>此外,在每个文本的最后一句之前还有一个“:”.

我想做什么,创建两列,以便:

>只考虑第一个“:”而不是最后一个.
>如果前三个句子中有“:”,则将整个文本分成两列,否则,将所有文本保留在第二列中,将“NA”保留在第一列中.

想要’x’的输出:

Col1                                                        Col2 
 There is a horror movie running in the iNox theater.        If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

想要输出“y”(因为“:”因此在前三个句子中找不到):

Col1     Col2 
 NA       There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

就像上面’y’的结果一样,’z’的通缉输出结果应该是:

Col1    Col2
  NA      all of the text from 'z'

我想要做的是:

resX <- data.frame(Col1 = gsub("s:.*$","1",df$Text[[1]]),Col2 = gsub("^[^:]+(?:).s",df$Text[[1]]))

resY <- data.frame(Col1 = gsub("s:.*$",df$Text[[2]]),df$Text[[2]]))

resZ <- data.frame(Col1 = gsub("s:.*$",df$Text[[3]]),df$Text[[3]]))

然后使用rbind将上面的内容合并到结果数据帧“resDF”中.

问题是:

>以上可以使用“for()循环”或任何其他方法来完成,使代码更简单.
>“y”和“z”文本的结果不是我想要的(如上所示).

解决方法

你可以试试这个负面的前瞻性正则表达式:

^(?s)(?!(?:(?:[^:]*?.){3,}))(.*?):(.*)$

Regex Demo and Detailed explanation of the regex

Updated:

如果你的条件满足,那么正则表达式将返回true,你应该得到2份

第1组包含第一个值:第2组将包含值.

如果条件未满足,则将整个字符串复制到第2列并将所需的任何内容作为第1列

包含名为过程数据的方法的更新样本片段将为您完成这些技巧.如果条件满足,那么它将拆分数据并放入col1和col2 ….如果在输入中y和z的情况下不满足条件…它将NA放在col1和整个值中在col2.

运行示例源 – > ideone:

library(stringr)

    x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
    frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"


    y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
    frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : 
    If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"

    z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one. : And the data frame has a single row,the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"             


df <- data.frame(Text = c(x,stringsAsFactors = F)

resDF <- data.frame("Col1" = character(),"Col2" = character(),stringsAsFactors=FALSE)

   processData <- function(a) {
        patt <- "^(?s)(?!(?:(?:[^:]*?.){3,}))(.*?):(.*)$"    
        if(grepl(patt,a,perl=TRUE))
        {
            result<-str_match(a,patt)    
            col1<-result[2]
            col2<-result[3]
        }
        else
        {
            col1<-"NA"
            col2<-a
        }
       return(c(col1,col2))

    }



for (i in 1:nrow(df)){
tmp <- df[i,]
resDF[nrow(resDF) + 1,] <- processData(tmp)
}    


print(resDF)

样本输出:

Col1
1 There is a horror movie running in the iNox theater. 
2                                                    NA
3                                                    NA
                                                                                                                                                                                                                                                                                                                                                                                                                              Col2
1                                                        If row names are supplied of length one and the data n    frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). n    If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify n    the row names and not a column (by name or number) Can we go : Please
2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data n    frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : n    If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify n    the row names and not a column (by name or number) Can we go : Please
3      There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). n    If row names are supplied of length one. : And the data frame has a single row,the row.names is taken to specify n    the row names and not a column (by name or number) Can we go : Please

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读