正则表达式模式匹配中的错误,用于将文本检索到数据帧的两列中

发布时间：2020-12-14 06:03:43 所属栏目：百科来源：网络整理

导读：考虑以下假设数据： x - "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number)

考虑以下假设数据：

x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"


y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : 
If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). 
If row names are supplied of length one. : And the data frame has a single row,the row.names is taken to specify 
the row names and not a column (by name or number) Can we go : Please"

df <- data.frame(Text = c(x,y,z),row.names = NULL,stringsAsFactors = F)

您是否注意到在不同位置有“：”.例如：

>在’x’中它(“：”)在第一句之后.
>在’y’中它(“：”)在第四句之后.
>并且在’z’中它是在第六句之后.
>此外,在每个文本的最后一句之前还有一个“：”.

我想做什么,创建两列,以便：

>只考虑第一个“：”而不是最后一个.
>如果前三个句子中有“：”,则将整个文本分成两列,否则,将所有文本保留在第二列中,将“NA”保留在第一列中.

想要’x’的输出：

Col1                                                        Col2 
 There is a horror movie running in the iNox theater.        If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

想要输出“y”(因为“：”因此在前三个句子中找不到)：

Col1     Col2 
 NA       There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number) Can we go : Please

就像上面’y’的结果一样,’z’的通缉输出结果应该是：

Col1    Col2
  NA      all of the text from 'z'

我想要做的是：

resX <- data.frame(Col1 = gsub("s:.*$","1",df$Text[[1]]),Col2 = gsub("^[^:]+(?:).s",df$Text[[1]]))

resY <- data.frame(Col1 = gsub("s:.*$",df$Text[[2]]),df$Text[[2]]))

resZ <- data.frame(Col1 = gsub("s:.*$",df$Text[[3]]),df$Text[[3]]))

然后使用rbind将上面的内容合并到结果数据帧“resDF”中.

问题是：

>以上可以使用“for()循环”或任何其他方法来完成,使代码更简单.
>“y”和“z”文本的结果不是我想要的(如上所示).

解决方法

你可以试试这个负面的前瞻性正则表达式：

^(?s)(?!(?:(?:[^:]*?.){3,}))(.*?):(.*)$

Regex Demo and Detailed explanation of the regex

Updated:

如果你的条件满足,那么正则表达式将返回true,你应该得到2份

第1组包含第一个值：第2组将包含值.

如果条件未满足,则将整个字符串复制到第2列并将所需的任何内容作为第1列

包含名为过程数据的方法的更新样本片段将为您完成这些技巧.如果条件满足,那么它将拆分数据并放入col1和col2 ….如果在输入中y和z的情况下不满足条件…它将NA放在col1和整个值中在col2.

运行示例源 – > ideone：

library(stringr)

    x <- "There is a horror movie running in the iNox theater. : If row names are supplied of length one and the data 
    frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"


    y <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data 
    frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : 
    If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"

    z <- "There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). 
    If row names are supplied of length one. : And the data frame has a single row,the row.names is taken to specify 
    the row names and not a column (by name or number) Can we go : Please"             


df <- data.frame(Text = c(x,stringsAsFactors = F)

resDF <- data.frame("Col1" = character(),"Col2" = character(),stringsAsFactors=FALSE)

   processData <- function(a) {
        patt <- "^(?s)(?!(?:(?:[^:]*?.){3,}))(.*?):(.*)$"    
        if(grepl(patt,a,perl=TRUE))
        {
            result<-str_match(a,patt)    
            col1<-result[2]
            col2<-result[3]
        }
        else
        {
            col1<-"NA"
            col2<-a
        }
       return(c(col1,col2))

    }



for (i in 1:nrow(df)){
tmp <- df[i,]
resDF[nrow(resDF) + 1,] <- processData(tmp)
}    


print(resDF)

样本输出：

Col1
1 There is a horror movie running in the iNox theater. 
2                                                    NA
3                                                    NA
                                                                                                                                                                                                                                                                                                                                                                                                                              Col2
1                                                        If row names are supplied of length one and the data n    frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). n    If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify n    the row names and not a column (by name or number) Can we go : Please
2 There is a horror movie running in the iNox theater. If row names are supplied of length one and the data n    frame has a single row,the row.names is taken. To specify the row names and not a column. By name or number. : n    If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify n    the row names and not a column (by name or number) Can we go : Please
3      There is a horror movie running in the iNox theater. If row names are supplied of length one and the data frame has a single row,the row.names is taken to specify the row names and not a column (by name or number). n    If row names are supplied of length one. : And the data frame has a single row,the row.names is taken to specify n    the row names and not a column (by name or number) Can we go : Please

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!