加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 综合聚焦 > 资源网站 > 资源 > 正文


发布时间:2020-12-14 18:26:56 所属栏目:资源 来源:网络整理
导读:这是关于SO的第一个问题,请告诉我是否可以改进.我正在研究R中的自然语言处理项目,并且正在尝试构建包含测试用例的data.table.在这里,我构建了一个简化的示例: texts.dt - data.table(string = c("one","two words","three words here","four useless words
texts.dt <- data.table(string = c("one","two words","three words here","four useless words here","five useless meaningless words here","six useless meaningless words here just","seven useless meaningless words here just to","eigth useless meaningless words here just to fill","nine useless meaningless words here just to fill up","ten useless meaningless words here just to fill up space"),word.count = 1:10,stop.at.word = c(0,1,2,4,3,6,7,5))


string word.count stop.at.word
 1:                                                      one          1            0
 2:                                                two words          2            1
 3:                                         three words here          3            2
 4:                                  four useless words here          4            2
 5:                      five useless meaningless words here          5            4
 6:                  six useless meaningless words here just          6            3
 7:             seven useless meaningless words here just to          7            3
 8:        eigth useless meaningless words here just to fill          8            6
 9:      nine useless meaningless words here just to fill up          9            7
10: ten useless meaningless words here just to fill up space         10            5

在实际应用程序中,stop.at.word列中的值是随机确定的(上限= word.count – 1).此外,字符串不按长度排序,但不应该有所不同.


                                                          string word.count stop.at.word                                       input
     1:                                                      one          1            0                                            
     2:                                                two words          2            1                                         two
     3:                                         three words here          3            2                                 three words
     4:                                  four useless words here          4            2                                four useless
     5:                      five useless meaningless words here          5            4              five useless meaningless words
     6:                  six useless meaningless words here just          6            2                                 six useless
     7:             seven useless meaningless words here just to          7            3                   seven useless meaningless
     8:        eigth useless meaningless words here just to fill          8            6   eigth useless meaningless words here just
     9:      nine useless meaningless words here just to fill up          9            7 nine useless meaningless words here just to
    10: ten useless meaningless words here just to fill up space         10            5          ten useless meaningless words here
     2:       words
     3:        here
     4:       words
     5:        here
     6: meaningless
     7:       words
     8:          to
     9:        fill
    10:        just


string word.count stop.at.word input output
 1:                                                      one          1            0             
 2:                                                two words          2            1    NA     NA
 3:                                         three words here          3            2    NA     NA
 4:                                  four useless words here          4            2    NA     NA
 5:                      five useless meaningless words here          5            4    NA     NA
 6:                  six useless meaningless words here just          6            3    NA     NA
 7:             seven useless meaningless words here just to          7            3    NA     NA
 8:        eigth useless meaningless words here just to fill          8            6    NA     NA
 9:      nine useless meaningless words here just to fill up          9            7    NA     NA
10: ten useless meaningless words here just to fill up space         10            5  ten      NA



texts.dt[,c("input","output") := .(
        substr(string,sapply(gregexpr(" ",string),"[",stop.at.word) - 1),substr(string,stop.at.word),stop.at.word + 1) - 1)




texts.dt[stop.at.word > 0,"output") := {
  sp = strsplit(string," ")
    mapply(function(p,n) paste(p[seq_len(n)],collapse = " "),sp,mapply(`[`,stop.at.word+1L)

# partial result

                    string word.count stop.at.word        input output
1:                     one          1            0           NA     NA
2:               two words          2            1          two  words
3:        three words here          3            2  three words   here
4: four useless words here          4            2 four useless  words


texts.dt[stop.at.word > 0,"output") := {
  patt = paste0("((w+ ){",stop.at.word-1,"}w+) (.*)")
  m    = stri_match(string,regex = patt)


