bash – while与splited文件的输入并行循环
我被困在那.所以我在我的代码中有这个while-read循环需要很长时间,我想在很多处理器中运行它.但是,我想分割输入文件并运行14个循环(因为我有14个线程),每个分割文件一个并行运行.事情是,我不知道如何告诉while循环哪个文件可以使用和使用.
例如,在常规的while-read循环中,我会编码: while read line do <some code> done < input file or variable... 但在这种情况下,我想将上面的输入文件拆分为14个文件,并在并行循环中运行14个循环,每个拆分文件一个循环. split -n 14 input_file find . -name "xa*" | parallel -j 14 | while read line do <lot of stuff> done 也试过了 split -n 14 input_file function loop { while read line do <lot of stuff> done } export -f loop parallel -j 14 ::: loop 但是我都没能告诉哪个文件是循环的输入,所以并行会理解“将每个xa *文件并行放入各个循环” 输入文件的示例(字符串列表) AEYS01000010.10484.12283 CVJT01000011.50.2173 KF625180.1.1799 KT949922.1.1791 LOBZ01000025.54942.57580 编辑 这是代码. Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank D_6__Bacillus_atrophaeus 50% 1 20 5% D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351% $head input file AEYS01000010.10484.12283 CVJT01000011.50.217 KF625180.1.1799 KT949922.1.1791 LOBZ01000025.54942.57580 循环内还使用了两个附加文件.它们不是循环输入. $head alnout_file KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0 KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0 KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0 2)数据库.tsv文件,其包含对应于DNA序列的约45000个分类法.每个分类都在一行中: $head taxonomy.file.tsv KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus 因此,给定序列KF625180.1.1799.我之前将它与一个含有~45000个其他DNA序列的数据库对齐,并得到一个输出,它具有与其匹配的序列的所有序列.循环的作用是找到所有这些序列的分类法并计算我之前提到的“统计数据”.代码对我所有的DNA序列进行了检测. TAXONOMY=path/taxonomy.file.tsv while read line do #find hits hits=$(grep $line alnout_file | cut -f 2) completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g') printf "nRate of taxonomies for this sequence in %%:t$completenamen" printf "Taxonomyt%alignedtnumber_ocurrences_in_the_alignmenttnum_ocurrences_in_databankt%alingment/databankn" #find hits and calculate the frequence (%) of the taxonomy in the alignment output # ex.: Bacillus_subtilis 33 freqHits=$(grep "${hits[@]}" $TAXONOMY | cut -f 2 | awk '{a[$0]++} END {for (i in a) {print i,"t",a[i]/NR*100,a[i]}}' | sed -e 's/D_0.*D_5/D_5/g' -e 's#sts#t#g' | sort -k2 -hr) # print frequence of each taxonomy in the databank freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1)) #print cols with taxonomy and calculations paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,$2"%",$3,$4,$3/$4*100"%"}' done < input_file 这是很多greps和解析所以在一个处理器中运行大约需要12小时才能完成所有45000个DNA序列的加入.我想拆分input_file并在我拥有的所有处理器中执行它(14),因为它会耗费时间. 解决方法
作为替代方案,我将快速测试集中在一起.
#! /bin/env bash mkfifo PIPELINE # create a single queue cat "$1" > PIPELINE & # supply it with records { declare -i cnt=0 max=14 while (( ++cnt <= max )) # spawn loop creates worker jobs do printf -v fn "%02d" $cnt while read -r line # each work loop reads common stdin... do echo "$fn:[$line]" sleep 1 done >$fn.log 2>&1 & # these run in background in parallel done # this one exits } < PIPELINE # *all* read from the same queue wait cat [0-9][0-9].log 不需要拆分,但确实需要mkfifo. 显然,更改内部循环内的代码. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |