bash – while与splited文件的输入并行循环

发布时间：2020-12-15 21:39:44 所属栏目：安全来源：网络整理

导读：我被困在那.所以我在我的代码中有这个while-read循环需要很长时间,我想在很多处理器中运行它.但是,我想分割输入文件并运行14个循环(因为我有14个线程),每个分割文件一个并行运行.事情是,我不知道如何告诉while循环哪个文件可以使用和使用. 例如,在常规的whil

我被困在那.所以我在我的代码中有这个while-read循环需要很长时间,我想在很多处理器中运行它.但是,我想分割输入文件并运行14个循环(因为我有14个线程),每个分割文件一个并行运行.事情是,我不知道如何告诉while循环哪个文件可以使用和使用.

例如,在常规的while-read循环中,我会编码：

while read line
do
   <some code>
done < input file or variable...

但在这种情况下,我想将上面的输入文件拆分为14个文件,并在并行循环中运行14个循环,每个拆分文件一个循环.
我试过了：

split -n 14 input_file
find . -name "xa*" | 
        parallel -j 14 | 
        while read line
        do
        <lot of stuff>
        done

也试过了

split -n 14 input_file
function loop {
            while read line
            do
                <lot of stuff>
            done
}
export -f loop
parallel -j 14 ::: loop

但是我都没能告诉哪个文件是循环的输入,所以并行会理解“将每个xa *文件并行放入各个循环”

输入文件的示例(字符串列表)

AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580

编辑

这是代码.
输出是一个表(741100行),其中包含有关已经进行的DNA序列比对的一些统计数据.
循环采用input_file(没有折断线,从500到~45000行,800Kb不等)进行DNA序列分析,逐行读取并查找每个对应的数据库中的完整分类(~45000行) .然后,它做了一些总和/分裂.输出是.tsv,看起来像这样(序列“KF625180.1.1799”的示例)：

Rate of taxonomies for this sequence in %:        KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment     num_ocurrences_in_databank    %alingment/databank
D_6__Bacillus_atrophaeus   50%     1       20      5%
D_6__Bacillus_amyloliquefaciens    50%     1       154     0.649351%



$head input file  
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580

循环内还使用了两个附加文件.它们不是循环输入.
1)一个名为alnout_file的文件,仅用于查找给定序列对数据库的命中数(或比对数).它也是在此循环之外进行的.它可以从hundreads到数千行的行数不同.只有第1列和第2列在这里很重要. Column1是序列的名称,col2是它在databnk中匹配的所有序列的名称.它看起来像这样：

$head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0   431     0       0       1       431     1       431     -1      0
KF625180.1.1799 KP143082.1.1457 99.3    431     1       2       1       431     1       429     -1      0
KP143082.1.1457 KF625180.1.1799 99.3    431     1       2       1       429     1       431     -1      0

2)数据库.tsv文件,其包含对应于DNA序列的约45000个分类法.每个分类都在一行中：

$head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus

因此,给定序列KF625180.1.1799.我之前将它与一个含有~45000个其他DNA序列的数据库对齐,并得到一个输出,它具有与其匹配的序列的所有序列.循环的作用是找到所有这些序列的分类法并计算我之前提到的“统计数据”.代码对我所有的DNA序列进行了检测.

TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
        hits=$(grep $line alnout_file | cut -f 2)
        completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
        printf "nRate of taxonomies for this sequence in %%:t$completenamen"
        printf "Taxonomyt%alignedtnumber_ocurrences_in_the_alignmenttnum_ocurrences_in_databankt%alingment/databankn"

        #find hits and calculate the frequence (%) of the taxonomy in the alignment output
        # ex.: Bacillus_subtilis 33
        freqHits=$(grep "${hits[@]}" $TAXONOMY | 
                cut -f 2 | 
                awk '{a[$0]++} END {for (i in a) {print i,"t",a[i]/NR*100,a[i]}}' | 
                sed -e 's/D_0.*D_5/D_5/g' -e 's#sts#t#g' | 
                sort -k2 -hr)

        # print frequence of each taxonomy in the databank

        freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
        #print cols with taxonomy and calculations
        paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,$2"%",$3,$4,$3/$4*100"%"}'

done < input_file

这是很多greps和解析所以在一个处理器中运行大约需要12小时才能完成所有45000个DNA序列的加入.我想拆分input_file并在我拥有的所有处理器中执行它(14),因为它会耗费时间.
谢谢大家对我这么耐心=)

解决方法

作为替代方案,我将快速测试集中在一起.

#! /bin/env bash
mkfifo PIPELINE             # create a single queue
cat "$1" > PIPELINE &       # supply it with records
{ declare -i cnt=0 max=14
  while (( ++cnt <= max ))  # spawn loop creates worker jobs
  do printf -v fn "%02d" $cnt
     while read -r line     # each work loop reads common stdin...
     do echo "$fn:[$line]"
        sleep 1
     done >$fn.log 2>&1 &   # these run in background in parallel
  done                      # this one exits
} < PIPELINE                # *all* read from the same queue
wait
cat [0-9][0-9].log

不需要拆分,但确实需要mkfifo.

显然,更改内部循环内的代码.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!