加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 大数据 > 正文

submit assembly to NCBI

发布时间:2020-12-16 00:27:59 所属栏目:大数据 来源:网络整理
导读:二、submit assembly to NCBI 1、prepare data 首先要具有fasta格式数据(NO .gz),这是处理的基础,具体格式如下: Scaffold633TCATTTCTCCACTCTCGATGAACAAATCTGGAGGGATTTTTTTTCATTCCACTCAATAGGTTGTCTATAAAGGTGTGATTCGTGGAACTTCTTCACACAGCAGCTAGTCTATATAATACA

二、submit assembly to NCBI

1、prepare data

首先要具有fasta格式数据(NO .gz),这是处理的基础,具体格式如下:

>Scaffold633

TCATTTCTCCACTCTCGATGAACAAATCTGGAGGGATTTTTTTTCATTCC

ACTCAATAGGTTGTCTATAAAGGTGTGATTCGTGGAACTTCTTCACACAG

CAGCTAGTCTATATAATACAGAAGATCG

>Scaffold553

AAAAAATTTTTTTTTTAAACTATCATCTCATGGATCAGCAGCAATTCTGA

GTGTAACGTCTTCATTAAATGCGTATATAAATTTGCATAAAGATATGCGA

CCAATATTGAGCCTGGAAATATATGCGCAGAGTGCAAAATTGTGTTTTTT

GATCGGTTAATTAAAGG

>Scaffold641

GTTTCCCAGTAGGTCTCTCCCGCTACGGCGTCCGCACGAACGCGATCTGC

CCTCGTGCCCGCACCGCCATGACGGCAGAAGCCTTCGGCGAGAACAACAC

CGGCGTCGTCGGCCTCGATCCGCTTGCACCCGAGCGCGTCGCGACCCTGG

TCAGCTACCTCGCATCCCCCGATTCCGACGAGATCAACGGACAGGTCTTC

GTCGTCTACGGCAAGATGGTGGCGTTGATGGAAGCACCCAAGGTCGAGAA

CCGTTTCGACGCAGCCGGATCCGCGTTCACCGTCGAAGAACTCGGTGGCC

AGCTCTCGTCTTACTTCTCCGGCCGTGGGCCGTACGAGACCTACTGGGAA

AC

2、处理数据

分为几步:

(1)生成.greater,short.list和ZERO_BASE_COUNT文件

perl ?../ scaf_filter_2k.pl? Ascaris_suum.scaf.fa

scaf_filter_2k.pl代码

#!/usr/bin/perl

use strict;

use warnings;

 

my $file=shift;

#my $cutoff=shift;

my $outfile="short.list";

my $outfile2="$file.greater";

my $outfile3="ZERO_BASE_COUNT";

 

open IN,"< $file" or die $!;

open OUT,"> $outfile" or die $!;

open OUT1,"> $outfile2" or die $!;

open OUT2,"> $outfile3" or die $!;

$/='>';<IN>;$/="n";

while(<IN>){

        chomp;

        my $id=$1 if(/^(S+)/);

        $/='>';

        my $seq=<IN>;

        chomp($seq);

        $/="n";

        $seq=~s/s//g;

        my $len=length($seq);

        if ($len < 200){

                print  OUT "$idt$lenn";

                next;

        }

        else{

                my $a=$seq=~tr/aA/aA/;

                my $t=$seq=~tr/tT/tT/;

                my $c=$seq=~tr/cC/cC/;

                my $g=$seq=~tr/gG/gG/;

                if($a==0 || $c==0 || $g==0 || $t==0){

                        print OUT2 "$idt$lenn";

                        next;

                }

                print OUT1 ">$idn$seqn";

        }

}

close IN;

close OUT;

close OUT1;

close OUT2;

(2)生成.Nchange文件。

perl ../ info_N_plus.pl? Ascaris_suum.scaf.fa.greater? > Ascaris_suum.scaf.fa.greater.Nchange

info_N_plus.pl代码:

#!/usr/bin/perl -w

use strict;

#use Getopt::Long;

sub usage{

  print STDERR <<USAGE;

      ############################################

            Version 1.1 by Wing-L 2011.07.15

 

      usage: $0 <sequence.fa> <len> >STDOUT

 

      ############################################

USAGE

exit;

}

&usage if(@ARGV <1);

my ($fa,$len)=@ARGV;

$len||=10;

open IN,"<$fa" or die("$!n");

$/='>';<IN>;$/="n";

while(my $line=<IN>){

    my @block;

  $line=~/^S+/;

  my $tag={1};

  $/='>';

  my $seq=<IN>;

  chomp $seq;

  $/="n";

  $seq=~s/s//g;

  my $chr_length=length $seq;

  while($seq=~/[^N]N{1,9}[^N]/g){

    substr ($seq,$-[0]+1,$len)=~s/S/N/g;

  }

  while($seq=~/N([^N]{1,49})N/g){

      my $tlen=length $1;

      substr ($seq,$tlen)=~s/S/N/g;

  }

  if($seq=~/^N?[^N]{0,49}N+/){

      print STDERR "$tagt1t{1}[0]n";

    substr($seq,{1}[0]-$-[0])='';

  }

  if($seq=~/N+[^N]{0,49}N{0,}$/){

    print STDERR "$tagt$-[0]t$chr_lengthn";

    substr($seq,$-[0],$chr_length-$-[0])='';

  }

  print ">$tagn$seqn";

}

close IN;

#open IN,"<" or die("$!n");

#while(my $line=<IN>){}

#foreach my $e (){}

#(split /s+/,$line)[0]

#open OUT,">" or die("$!n");

###############  sub  ###############

(3)生成分割文件

perl? ../get_scaftig.pl? Ascaris_suum.scaf.fa.greater.Nchange? > Ascaris_suum.scaf.fa.greater.Nchange.split

get_scaftig.pl代码:

#!/usr/bin/perl -w

#

#Author: Ruan Jue <ruanjue@genomics.org.cn>

#

use warnings;

use strict;

 

my $min_length = 0;

my $name = '';

my $seq = '';

 

while(<>){

   if(/^>(S+)/){

      &print_scafftig($name,$seq) if($seq);

      $name = $1;

      $seq  = '';

   } else {

      chomp;

      $seq .= 
{1}

; }}&print_scafftig($name,$seq) if($seq); 1; sub print_scafftig { my $name = shift; my $seq = shift; my $temp = $seq; my $id = 1; my $flag = 0; my $pos = 1; while($seq=~/([ATGCatgc]+)/g){ my $s = $1; if($flag==1){ if($temp=~/([ATGCatgc]+[Nn]+)/g){ my $g = $1; $pos+=length($g); } } else{$flag=1;} next if(length($s) < $min_length); my $length=length($s); my $end=$pos+$length-1;# print ">$name_$id start=$pos length=".length($s)."n"; print ">$name_$idt$post$endt$lengthn"; while($s=~/(.{1,60})/g){ print "$1n"; } $id++;}}

(4)生成.agp文件

perl ../AGP.pl Ascaris_suum.scaf.fa.greater.Nchange.split > Ascaris_suum.scaf.fa.agp

AGP.pl 代码:

#!/usr/bin/perl

use strict;

 

my $fasta=shift;

 

my %Scaf;

open IN,$fasta or die "$!";

while(<IN>){

        if (/^>(S+)(_d+)s+(d+)s+(d+)s+(d+)/){

                my $scaf=$1;

                my $contig=$1.$2;

                my $start=$3;

                my $end=$4;

                my $len=$5;

                push @{$Scaf{$scaf}},[$contig,$start,$end,$len];

        }

}

close IN;

 

foreach my $id ( sort keys %Scaf ){

        @{$Scaf{$id}} = sort {$a->[1]<=>$b->[1]}  @{$Scaf{$id}};

}

 

foreach my $id ( sort { $Scaf{$b}[-1][2] <=> $Scaf{$a}[-1][2] } keys %Scaf ){

        my $p=$Scaf{$id};

        my $idx=1;

        for(my $i=0;$i<@$p;$i++){

                if ($i>0){

                        print join("t",$id,$p->[$i-1][2]+1,$p->[$i][1]-1,$idx,'N',($p->[$i][1]-1)-($p->[$i-1][2]+1)+1,'fragment','yes',"t")."n";

                        $idx++;

                }

                print join("t",$p->[$i][1],$p->[$i][2],'W',$p->[$i][0],1,$p->[$i][3],'+')."n";

                $idx++;

        }

}

至此为止,第一步准备数据的过程已经完成。下一步就是向3811提交任务。

这个过程需要的东西:

l? Ascaris_suum.scaf.fa.greater.Nchange.gz

l? Ascaris_suum.agp.gz

和NCBI沟通的文件,主要是确定的Project ID的准确性。

未完待续.........

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读