bash – 如何在列表中的单行上并行运行grep
我是bash的初学者.我需要一些帮助才能提高工作效率.
while read line do echo "$line" file="Species.$line" grep -A 1 "$line" /project/ag-grossart/ionescu/DB/rRNADB/SILVA_123.1_SSURef_one_line.fasta > $file done < species1 该文件物种包含约100,000种物种名称.我正在搜索的文件是24 GB fasta(文本)文件. 大文件的格式是: Domain;Phylum;Class;Order;Family;Genus;Species AGCT —- AGCT(每行50,000个字符) 这是物种文件的样本(中间没有空行) Alkanindiges_illinoisensis Alkanindiges_sp._JJ005 Alligator_sinensis Allisonella_histaminiformans 'Allium_cepa' Alloactinosynnema_album Alloactinosynnema_sp._Chem10 Alloactinosynnema_sp._CNBC1 Alloactinosynnema_sp._CNBC2 Alloactinosynnema_sp._FMA Alloactinosynnema_sp._MN08-A0205 Allobacillus_halotolerans Allochromatium_truperi Allochromatium_vinosum 这是大文件的第一行: HP451749.6.1794_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Puccinia;Puccinia_triticina.............................................................................-UC-U-G--G-U--------------------------- (this goes one for 50,000 characters per line) 这里有一些标题: >EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens >X96499.1.1810_Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Marchantiophyta;Jungermanniales;Calypogeia;Plagiochila_adiantoides >AB034906.1.1763_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Citeromyces;Citeromyces_siamensis >AY290717.1.1208_Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanohalophilus;Methanohalophilus_portucalensis_FDF-1 >EF164984.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_pulli >AY291120.1.1477_Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Lampropedia;Lampropedia_hyalina >EF164987.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli >JQ838073.1.1461_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS01 >EF164989.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli >JQ838076.1.1460_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS04 >AB035584.1.1789_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Tremellomycetes;Tremellales;Trichosporonaceae;Trichosporon;Trichosporon_debeurmannianum >JQ838080.1.1457_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS11 >EF165015.1.1527_Bacteria;Firmicutes;Clostridia;Clostridiales;Family_XI;Tepidimicrobium;Clostridium_sp._PML3-1 >U85867.1.1424_Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter;Marinobacter_sp. >EF165044.1.1398_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium_sp._CBMB38 >U85870.1.1458_Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas_sp. >EF165046.1.1380_Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Pantoea;Pantoea_sp._CBMB55 我需要每个物种一个包含所有匹配序列的文件. 上面的代码可以工作,但是在16个小时内,它设法完成了不到2000种. 我想并行运行它以加快速度.关于提高搜索效率的任何其他提示也是受欢迎的. 谢谢 解决方法
有点比我想象的更棘手,因为匹配的行需要分开文件 – 如果你有机会请发表性能 – 这个解决方案也可以并行使用 – 种类列表文件可以分块和/或fasta文件可以分块并馈送到脚本的并行运行
在Intel Xeon E5上花费大约1分钟,检查10,000种物品的6GB伪造数据文件 – 但是即使在10,000个块中,物种列表也增加到100,0000是有问题的,因为我遇到了创建的许多文件的磁盘问题附加到一个目录中 – 问题在物种列表超过50,000时开始 – 这个数字在其他系统上会有所不同 – 我修改了脚本以创建100个子目录并将每个目录限制为1000个文件 – 这很有效并且生成了所有100,000个文件无需将物种列表或6GB数据文件分块 另外,为了让您了解grep的速度有多快 – 在6GB文件中花费6秒钟来匹配100,000种物种 specieslist=$1 nspecies=$(wc -l $specieslist|cut -f1 -d' ') echo -e "grep $nspecies species from $specieslistn" grep -A1 -F -f $specieslist| awk ' # skip context marker /^--$/{next} # process pair of lines # first line is matching species header line # species is semicolon-delimited field 7 of first line # second line is sequence - both lines are written to a file with sanitized species name { split($0,flds,";") species=flds[7] filekey=gensub(/W/,".","g",species) file="fastaout." filekey if(!(filekey in outfiles)) { outfiles[filekey]=file printf("species "%s" outfile "%s" first match line %d: "%s"n",species,file,NR,$0) print >file } getline; print >>file # close may be needed on systems where awk cannot juggle too many open files close(outfile) } ' outfiles=(fastaout.*) noutfiles=${#outfiles[*]} echo -e "ncreated $noutfiles fastaout.* files" head -5 fastaout* 输出和略微修改的测试输入如下 – 物种列表有一些实际的匹配 – fasta文件序列行以小写物种为前缀以验证正确性并避免再次匹配物种 产量 $head out.* ==> out.Brachyspira_innocens <== brachyspira_innocens.1:-UC-U-G--G-U--------------------------- brachyspira_innocens.2:-UC-U-G--G-U--------------------------- ==> out.Methanohalophilus_portucalensis_FDF-1 <== methanohalophilus_portucalensis_fdf-1:-UC-U-G--G-U--------------------------- ==> out.Pucciniomycotina <== pucciniomycotina:-UC-U-G--G-U--------------------------- 物种清单 Allobacillus_halotolerans Allochromatium_truperi Allochromatium_vinosum Methanohalophilus_portucalensis_FDF-1 Brachyspira_innocens Pucciniomycotina fasta文件 HP451749.6.1794_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Puccinia;Puccinia_triticina;............................................................................. pucciniomycotina:-UC-U-G--G-U--------------------------- >EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens brachyspira_innocens.1:-UC-U-G--G-U--------------------------- >X96499.1.1810_Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Marchantiophyta;Jungermanniales;Calypogeia;Plagiochila_adiantoides plagiochila_adiantoides:-UC-U-G--G-U--------------------------- >AB034906.1.1763_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Citeromyces;Citeromyces_siamensis citeromyces_siamensis:-UC-U-G--G-U--------------------------- >AY290717.1.1208_Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanohalophilus;Methanohalophilus_portucalensis_FDF-1 methanohalophilus_portucalensis_fdf-1:-UC-U-G--G-U--------------------------- >EF164984.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_pulli brachyspira_pulli:-UC-U-G--G-U--------------------------- >AY291120.1.1477_Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Lampropedia;Lampropedia_hyalina lampropedia_hyalina:-UC-U-G--G-U--------------------------- >EF164987.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli brachyspira_alvinipulli:-UC-U-G--G-U--------------------------- >JQ838073.1.1461_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS01 streptomyces_sp._qls01:-UC-U-G--G-U--------------------------- >EF164989.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli brachyspira_alvinipulli:-UC-U-G--G-U--------------------------- >JQ838076.1.1460_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS04 streptomyces_sp._qls04:-UC-U-G--G-U--------------------------- >AB035584.1.1789_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Tremellomycetes;Tremellales;Trichosporonaceae;Trichosporon;Trichosporon_debeurmannianum trichosporon_debeurmannianum:-UC-U-G--G-U--------------------------- >JQ838080.1.1457_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS11 streptomyces_sp._qls11:-UC-U-G--G-U--------------------------- >EF165015.1.1527_Bacteria;Firmicutes;Clostridia;Clostridiales;Family_XI;Tepidimicrobium;Clostridium_sp._PML3-1 clostridium_sp._pml3-1:-UC-U-G--G-U--------------------------- >U85867.1.1424_Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter;Marinobacter_sp. Marinobacter_sp.:-UC-U-G--G-U--------------------------- >EF165044.1.1398_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium_sp._CBMB38 methylobacterium_sp._cbmb38:-UC-U-G--G-U--------------------------- >U85870.1.1458_Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas_sp. pseudomonas_sp.:-UC-U-G--G-U--------------------------- >EF165046.1.1380_Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Pantoea;Pantoea_sp._CBMB55 pantoea_sp._cbmb55:-UC-U-G--G-U--------------------------- >EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens brachyspira_innocens.2:-UC-U-G--G-U--------------------------- (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |