高通量测序分析工具Bedtools使用介绍
http://blog.genesino.com/2018/04/bedtools/ Bedtools是处理基因组信息分析的强大工具集合,其主要功能如下: bedtools: flexible tools for genome arithmetic and DNA sequence analysis. usage: bedtools <subcommand> [options] The bedtools sub-commands include: [ Genome arithmetic ] intersect Find overlapping intervals in various ways. 求区域之间的交集,可以用来注释peak,计算reads比对到的基因组区域 不同样品的peak之间的peak重叠情况。 window Find overlapping intervals within a window around an interval. closest Find the closest,potentially non-overlapping interval. 寻找最近但可能不重叠的区域 coverage Compute the coverage over defined intervals. 计算区域覆盖度 map Apply a function to a column for each overlapping interval. genomecov Compute the coverage over an entire genome. merge Combine overlapping/nearby intervals into a single interval. 合并重叠或相接的区域 cluster Cluster (but don't merge) overlapping/nearby intervals. complement Extract intervals _not_ represented by an interval file. 获得互补区域 subtract Remove intervals based on overlaps b/w two files. 计算区域差集 slop Adjust the size of intervals. 调整区域大小,如获得转录起始位点上下游3 K的区域 flank Create new intervals from the flanks of existing intervals. sort Order the intervals in a file. 排序,部分命令需要排序过的bed文件 random Generate random intervals in a genome. 获得随机区域,作为背景集 shuffle Randomly redistrubute intervals in a genome. 根据给定的bed文件获得随机区域,作为背景集 sample Sample random records from file using reservoir sampling. spacing Report the gap lengths between intervals in a file. annotate Annotate coverage of features from multiple files. [ Multi-way file comparisons ] multiinter Identifies common intervals among multiple interval files. unionbedg Combines coverage intervals from multiple BEDGRAPH files. [ Paired-end manipulation ] pairtobed Find pairs that overlap intervals in various ways. pairtopair Find pairs that overlap other pairs in various ways. [ Format conversion ] bamtobed Convert BAM alignments to BED (& other) formats. bedtobam Convert intervals to BAM records. bamtofastq Convert BAM records to FASTQ records. bedpetobam Convert BEDPE intervals to BAM records. bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals. [ Fasta manipulation ] getfasta Use intervals to extract sequences from a FASTA file. 提取给定位置的FASTA序列 maskfasta Use intervals to mask sequences from a FASTA file. nuc Profile the nucleotide content of intervals in a FASTA file. [ BAM focused tools ] multicov Counts coverage from multiple BAMs at specific intervals. tag Tag BAM alignments based on overlaps with interval files. [ Statistical relationships ] jaccard Calculate the Jaccard statistic b/w two sets of intervals. 计算数据集相似性 reldist Calculate the distribution of relative distances b/w two files. fisher Calculate Fisher statistic b/w two feature files. [ Miscellaneous tools ] overlap Computes the amount of overlap from two intervals. igv Create an IGV snapshot batch script. 用于生成一个脚本,批量捕获IGV截图 links Create a HTML page of links to UCSC locations. makewindows Make interval "windows" across a genome. 把给定区域划分成指定大小和间隔的小区间 (bin) groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy") 分组结算,不只可以用于bed文件。 expand Replicate lines based on lists of values in columns. split Split a file into multiple files with equal records or base pairs. 安装bedtools(Linux - Conda软件安装方法) ct@ehbio:~$ conda install bedtools 获得测试数据集(http://quinlanlab.org/tutorials/bedtools/bedtools.html)ct@ehbio:~$ mkdir bedtools ct@ehbio:~$ cd bedtools ct@ehbio:~$ url=https://s3.amazonaws.com/bedtools-tutorials/web ct@ehbio:~/bedtools$ curl -O ${url}/maurano.dnaseI.tgz ct@ehbio:~/bedtools$ curl -O ${url}/cpg.bed ct@ehbio:~/bedtools$ curl -O ${url}/exons.bed ct@ehbio:~/bedtools$ curl -O ${url}/gwas.bed ct@ehbio:~/bedtools$ curl -O ${url}/genome.txt ct@ehbio:~/bedtools$ curl -O ${url}/hesc.chromHmm.bed 交集 (intersect)查看输入文件, 包括), (1-based,不包括)。第四列一般为区域名字,第五列一般为空,第六列为链的信息。更详细解释见http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1。 自己做研究CpG岛信息可以从UCSC的Table Browser获得,具体操作见http://blog.genesino.com/2013/05/ucsc-usages/。 ct@ehbio:~/bedtools$ head -n 3 cpg.bed exons.bed ==> cpg.bed <== chr1 28735 29810 CpG:_116 chr1 135124 135563 CpG:_30 chr1 327790 328229 CpG:_29 ==> exons.bed <== chr1 11873 12227 NR_046018_exon_0_0_chr1_11874_f 0 + chr1 12612 12721 NR_046018_exon_1_0_chr1_12613_f 0 + chr1 13220 14409 NR_046018_exon_2_0_chr1_13221_f 0 + 获得重叠区域(既是外显子,又是CpG岛的区域) ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed | head -5 chr1 29320 29370 CpG:_116 chr1 135124 135563 CpG:_30 chr1 327790 328229 CpG:_29 chr1 327790 328229 CpG:_29 chr1 327790 328229 CpG:_29 输出重叠区域对应的原始区域(与外显子存在交集的CpG岛) ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wa -wb > | head -5
计算重叠碱基数 ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10
计算第一个(-a)bed区域有多少个重叠的第二个(-b)bed文件中有多少个区域 ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -c | head chr1 28735 29810 CpG:_116 1 chr1 135124 135563 CpG:_30 1 chr1 327790 328229 CpG:_29 3 chr1 437151 438164 CpG:_84 0 chr1 449273 450544 CpG:_99 0 chr1 533219 534114 CpG:_94 0 chr1 544738 546649 CpG:_171 0 chr1 713984 714547 CpG:_60 1 chr1 762416 763445 CpG:_115 10 chr1 788863 789211 CpG:_28 9 另外还有
同时对多个区域求交集 (可以用于peak的多维注释) # -names标注注释来源 # -sorted: 如果使用了这个参数,提供的一定是排序好的bed文件 ct@ehbio:~/bedtools$ bedtools intersect -a exons.bed -b cpg.bed gwas.bed hesc.chromHmm.bed -sorted -wa -wb -names cpg gwas chromhmm | head -10000 | tail -10
合并区域
只需要输入一个排序好的bed文件,默认合并重叠或邻接区域。 ct@ehbio:~/bedtools$ bedtools merge -i exons.bed | head -n 5 chr1 11873 12227 chr1 12612 12721 chr1 13220 14829 chr1 14969 15038 chr1 15795 15947 合并区域并输出此合并后区域是由几个区域合并来的 ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -c 1 -o count | head -n 5 chr1 11873 12227 1 chr1 12612 12721 1 chr1 13220 14829 2 chr1 14969 15038 1 chr1 15795 15947 1 合并相距 # -c: 指定对哪些列进行操作 # -o: 与-c对应,表示对指定列进行哪些操作 # 这里的用法是对第一列做计数操作,输出这个区域是由几个区域合并来的 # 对第4列做收集操作,记录合并的区域的名字,并逗号分隔显示出来 ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -d 340 -c 1,4 -o count,collapse | head -4 chr1 11873 12227 1 NR_046018_exon_0_0_chr1_11874_f chr1 12612 12721 1 NR_046018_exon_1_0_chr1_12613_f chr1 13220 15038 3 NR_046018_exon_2_0_chr1_13221_f,NR_024540_exon_0_0_chr1_14362_r,NR_024540_exon_1_0_chr1_14970_r chr1 15795 15947 1 NR_024540_exon_2_0_chr1_15796_r 计算互补区域给定一个全集,再给定一个子集,求另一个子集。比如给定每条染色体长度和外显子区域,求非外显子区域。给定基因区,求非基因区。给定重复序列,求非重复序列等。 重复序列区域的获取也可以用上面提供的链接 http://blog.genesino.com/2013/05/ucsc-usages/。 ct@ehbio:~/bedtools$ head genome.txt chr1 249250621 chr10 135534747 chr11 135006516 chr11_gl000202_random 40103 chr12 133851895 chr13 115169878 chr14 107349540 chr15 102531392 ct@ehbio:~/bedtools$ bedtools complement -i exons.bed -g genome.txt | head -n 5 chr1 0 11873 chr1 12227 12612 chr1 12721 13220 chr1 14829 14969 chr1 15038 15795 基因组覆盖广度和深度计算基因组某个区域是否被覆盖,覆盖深度多少。有下图多种输出格式,也支持RNA-seq数据,计算junction-reads覆盖。
# 对单行FASTA,可如此计算 # 如果是多行FASTA,则需要累加 ct@ehbio:~/bedtools$ awk 'BEGIN{OFS=FS="t"}{ if($0~/>/) {seq_name=$0;sub(">","",seq_name);} else {print seq_name,length;} }' ../bio/genome.fa | tee ../bio/genome.txt chr1 60001 chr2 54001 chr3 54001 chr4 60001 ct@ehbio:~/bedtools$ bedtools genomecov -ibam ../bio/map.sortP.bam -bga -g ../bio/genome.txt | head # 这个warning很有意思,因为BAM中已经有这个信息了,就不需要提供了 ***** *****WARNING: Genome (-g) files are ignored when BAM input is provided. ***** # bedgraph文件,前3列与bed相同,最后一列表示前3列指定的区域的覆盖度。 chr1 0 11 0 chr1 11 17 1 chr1 17 20 2 chr1 20 31 3 chr1 31 36 4 chr1 36 43 6 chr1 43 44 7 chr1 44 46 8 chr1 46 48 9 chr1 48 54 10 两个思考题:
ct@ehbio:~/bedtools$ bedtools jaccard -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed -b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bed intersection union-intersection jaccard n_intersections 81269248 160493950 0.50637 130852 小思考:1. 如何用bedtools其它工具算出这个结果?2. 如果需要比较的文件很多,怎么充分利用计算资源? 一个办法是使用 双层嵌套。这种用法也很常见,不管是单层还是双层for循环,都有利于简化重复运算。 ct@ehbio:~/bedtools$ for i in *.merge.bed; do for j in *.merge.bed; do bedtools jaccard -a $i -b $j | cut -f3 | tail -n +2 | sed "s/^/$it$jt/"; done; done >total.similarity 另一个办法是用 root@ehbio:~# yum install parallel.noarch # parallel 后面双引号("")内的内容为希望用parallel执行的命令, # 整体写法与Linux下命令写法一致。 # 双引号后面的 三个相邻冒号 (:::)默认用来传递参数的,可多个连写。 # 每个三冒号后面的参数会被循环调用,而在命令中的引用则是根据其出现的位置,分别用{1},{2} # 表示第一个三冒号后的参数,第二个三冒号后的参数。 # # 这个命令可以替换原文档里面的整合和替换,相比于原文命令生成多个文件,这里对每个输出结果 # 先进行了比对信息的增加,最后结果可以输入一个文件中。 # ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 | sed 's/^/{1}t{2}t/'" ::: `ls *.merge.bed` ::: `ls *.merge.bed` >totalSimilarity.2 # 上面的命令也有个小隐患,并行计算时的输出冲突问题,可以修改为输出到单个文件,再cat到一起 ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 | sed 's/^/{1}t{2}t/' >{1}.{2}.totalSimilarity_tmp" ::: `ls *.merge.bed` ::: `ls *.merge.bed` ct@ehbio:~/bedtools$ cat *.totalSimilarity_tmp >totalSimilarity.2 # 替换掉无关信息 ct@ehbio:~/bedtools$ sed -i -e 's/.hotspot.twopass.fdr0.05.merge.bed//' -e 's/.hg19//' totalSimilarity.2 原文档的命令,稍微有些复杂,利于学习不同命令的组合。使用时推荐使用上面的命令。 ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR>1' | cut -f 3 > {1}.{2}.jaccard" ::: `ls *.merge.bed` ::: `ls *.merge.bed` This command will create a single file containing the pairwise Jaccard measurements from all 400 tests. find . | grep jaccard | xargs grep "" | sed -e s"/.///" | perl -pi -e "s/.bed./.bedt/" | perl -pi -e "s/.jaccard:/t/" > pairwise.dnase.txt A bit of cleanup to use more intelligible names for each of the samples. cat pairwise.dnase.txt | sed -e 's/.hotspot.twopass.fdr0.05.merge.bed//g' | sed -e 's/.hg19//g' > pairwise.dnase.shortnames.txt Now let’s make a 20x20 matrix of the Jaccard statistic. This will allow the data to play nicely with R. awk 'NF==3' pairwise.dnase.shortnames.txt | awk '$1 ~ /^f/ && $2 ~ /^f/' | python make-matrix.py > dnase.shortnames.distance.matrix 在广大粉丝的期待下,《生信宝典》联合《宏基因组》于2018年4月14在北京鼓楼推出《ChIP-seq分析专题培训》,为大家提供一条走进生信大门的捷径、为同行提供一个ChIP-seq实战分析学习和交流的机会、助力学员真正理解分析原理和完成实战分析, 独创线下集中授课2天+自行练习5天+再集中讲解答疑2天+后期学习群的四段式教学,并提供学习视频,教、学、练、答结合,真正实现独立分析大数据。 关于学习生物信息学分析的重要性,请阅读《生物信息9天速成班—成为团队中不可或缺的人》。 ChIP-seq基本分析流程见流程。课程将在这个基础上,提供更深入地分析指导。 课程介绍
(编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |