perl应用:从NCBI提供的信息中获取需要的序列(上)use Arrays
发布时间:2020-12-15 21:02:47 所属栏目:大数据 来源:网络整理
导读:我们首先来看一下GenBank中序列的格式,我从http://www.ncbi.nlm.nih.gov/nuccore/JX118024.1中复制了这个信息到strawberry.gb中 LOCUS JX118024 460 bp DNA linear PLN 25-SEP-2012DEFINITION Fragaria vesca subsp. americana RNA polymerase beta subunit
我们首先来看一下GenBank中序列的格式,我从http://www.ncbi.nlm.nih.gov/nuccore/JX118024.1中复制了这个信息到strawberry.gb中 LOCUS JX118024 460 bp DNA linear PLN 25-SEP-2012 DEFINITION Fragaria vesca subsp. americana RNA polymerase beta subunit (rpoC1) gene,partial cds; plastid. ACCESSION JX118024 VERSION JX118024.1 GI:402238751 KEYWORDS . SOURCE plastid Fragaria vesca subsp. americana ORGANISM Fragaria vesca subsp. americana Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; fabids; Rosales; Rosaceae; Rosoideae; Potentilleae; Fragariinae; Fragaria. REFERENCE 1 (bases 1 to 460) AUTHORS Njuguna,W.,Liston,A.,Cronn,R.,Ashman,T.L. and Bassil,N. TITLE Insights into phylogeny,sex function and age of Fragaria based on whole chloroplast genome sequencing JOURNAL Mol. Phylogenet. Evol. (2012) In press PUBMED 22982444 REMARK Publication Status: Available-Online prior to print REFERENCE 2 (bases 1 to 460) AUTHORS Njuguna,T.-L. and Bassil,N. TITLE Direct Submission JOURNAL Submitted (16-MAY-2012) Horticulture,Oregon State University,4017 ALS,Corvallis,OR 97331,USA COMMENT ##Assembly-Data-START## Assembly Method :: yasra v. 2009; mulan v. 2005 Sequencing Technology :: Illumina ##Assembly-Data-END## FEATURES Location/Qualifiers source 1..460 /organism="Fragaria vesca subsp. americana" /organelle="plastid" /mol_type="genomic DNA" /sub_species="americana" /db_xref="taxon:101019" gene complement(<1..>460) /gene="rpoC1" CDS complement(<1..>460) /gene="rpoC1" /codon_start=2 /transl_table=11 /product="RNA polymerase beta subunit" /protein_id="AFQ39140.1" /db_xref="GI:402238752" /translation="ELVMCQEKLVQEAVDTLLDNGIRGQPMRDGHNKVYKSFSDVIEG KEGRFRETLLGKRVDYSGRSVIVVGPSLSLHRCGLPREIAIELFQTFVIRGLIRQHFA SNIGVAKSKIREKEPVVWEILQEVMQGHPVLLNRAPTLHRLGIQAFQPILV" ORIGIN 1 tactaaaatg ggctggaacg cctgtatgcc taatctatgc agagtaggcg ccctattcag 61 caatacggga tgcccttgca taacttcctg aagtatttcc catacaaccg gctctttttc 121 ccgaatttta ctcttagcca ctcctatatt cgaggcaaaa tgctgcctaa ttaaaccacg 181 aattacaaat gtctggaaaa gctctattgc tatttcgcga ggcaatccac atcgatgtaa 241 tgaaagtgaa gggcctacaa caatgacaga acgccccgaa taatcgaccc gtttgccaag 301 cagagtctca cggaatcttc cctcttttcc ttcaattaca tctgaaaacg acttgtaaac 361 cttattatga ccgtccctca ttggttgtcc acggattcca ttatcaagaa gcgtatccac 421 agcttcttgt accaatttct cctgacacat gactaattct // 这是一个很短的序列,我们来观察特征前面都是注释,不是我们想要的信息。从ORIGIN开始的下一行就是我们要获得的信息,但是每一行有数字标识,为了方便查看碱基的个数,这个也不是我们需要的。序列的最后是// ? 这个也就是结束的标识。 我们写这个程序的主要方法就是: 第一:忽略ORIGIN上面的所有内容 第二:当遇到//的时候停止 第三:把我们获取序列中的所有的数字替换为空 第四:注释和序列之间的区别,我们用了一个标志符in_sequence.在没有遇到ORIGIN的时候,值为0,当遇到ORIGIN是我们把它的值赋值为1. 第五:foreach中判断语句的顺序,我们按照从后往前的顺序,也就是也判读是不是最后的结束行,然后看如果是序列,然后ORI一行,然后是注释。 下面的程序我们把上边的所有注释内容页保存到@annotation中去了,并且也做了输出。 use strict; use warnings; my @annotation=( ); my $sequence =''; my $filename =''; parse1(@annotation,$sequence,$filename); print @annotation; print_sequence($sequence,50); #parse1 sub parse1 { my ($annotation,$dna,$filename) = @_; my $in_sequence = 0; my @GenBankFile =( ); @GenBankFile =get_file_data($filename); foreach my $line(@GenBankFile) { #用来匹配最后一行 if($line=~/^//n/) { last; } elsif($in_sequence) { $$dna .=$line; } elsif($line=~/^ORIGIN/) { $in_sequence = 1; } else { push(@annotation,$line); } } #remove whitespace adn line numbers from DNA sequence $$dna =~ s/[s0-9]//g; } sub get_file_data { # A subroutine to get data from a file given its filename #读取文件的子序列 my $dna_filename; my @filedata; print "please input the Path just like this f:\perl\data.txtn"; chomp($dna_filename=<STDIN>); open(DNAFILENAME,$dna_filename)||die("can not open the file!"); @filedata = <DNAFILENAME>; close DNAFILENAME; return @filedata;#子函数的返回值一定要记住写 } sub print_sequence { # A subroutine to format and print sequence data my ($sequence,$length) = @_; for (my $pos =0; $pos<length($sequence);$pos+=$length) { print substr($sequence,$pos,$length),"n"; } } 结果如下: F:&;perla.pl please input the Path just like this f:perldata.txt f:perlstrawberry.gb LOCUS JX118024 460 bp DNA linear PLN 25-SEP-2012 DEFINITION Fragaria vesca subsp. americana RNA polymerase beta subunit (rpoC1) gene,USA COMMENT ##Assembly-Data-START## Assembly Method :: yasra v. 2009; mulan v. 2005 Sequencing Technology :: Illumina ##Assembly-Data-END## FEATURES Location/Qualifiers source 1..460 /organism="Fragaria vesca subsp. americana" /organelle="plastid" /mol_type="genomic DNA" /sub_species="americana" /db_xref="taxon:101019" gene complement(<1..>460) /gene="rpoC1" CDS complement(<1..>460) /gene="rpoC1" /codon_start=2 /transl_table=11 /product="RNA polymerase beta subunit" /protein_id="AFQ39140.1" /db_xref="GI:402238752" /translation="ELVMCQEKLVQEAVDTLLDNGIRGQPMRDGHNKVYKSFSDVIEG KEGRFRETLLGKRVDYSGRSVIVVGPSLSLHRCGLPREIAIELFQTFVIRGLIRQHFA SNIGVAKSKIREKEPVVWEILQEVMQGHPVLLNRAPTLHRLGIQAFQPILV" tactaaaatgggctggaacgcctgtatgcctaatctatgcagagtaggcg ccctattcagcaatacgggatgcccttgcataacttcctgaagtatttcc catacaaccggctctttttcccgaattttactcttagccactcctatatt cgaggcaaaatgctgcctaattaaaccacgaattacaaatgtctggaaaa gctctattgctatttcgcgaggcaatccacatcgatgtaatgaaagtgaa gggcctacaacaatgacagaacgccccgaataatcgacccgtttgccaag cagagtctcacggaatcttccctcttttccttcaattacatctgaaaacg acttgtaaaccttattatgaccgtccctcattggttgtccacggattcca ttatcaagaagcgtatccacagcttcttgtaccaatttctcctgacacat gactaattct F:&; 如果我们把输入的@annotion删除了,我们就只得到我们想要的信息了。结果如下: F:&;perla.pl please input the Path just like this f:perldata.txt f:perlstrawberry.gb tactaaaatgggctggaacgcctgtatgcctaatctatgcagagtaggcg ccctattcagcaatacgggatgcccttgcataacttcctgaagtatttcc catacaaccggctctttttcccgaattttactcttagccactcctatatt cgaggcaaaatgctgcctaattaaaccacgaattacaaatgtctggaaaa gctctattgctatttcgcgaggcaatccacatcgatgtaatgaaagtgaa gggcctacaacaatgacagaacgccccgaataatcgacccgtttgccaag cagagtctcacggaatcttccctcttttccttcaattacatctgaaaacg acttgtaaaccttattatgaccgtccctcattggttgtccacggattcca ttatcaagaagcgtatccacagcttcttgtaccaatttctcctgacacat gactaattct F:&; (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |