在bash中并行运行有限数量的子进程？

发布时间：2020-12-15 19:32:13 所属栏目：安全来源：网络整理

导读：我有一大堆文件需要进行一些繁重的处理。这种处理在单线程中，使用了几百个MiB的RAM(在机器上用于启动作业)，并需要几分钟的时间运行。我目前的使用条件是在输入数据上启动hadoop作业，但在其他情况下，我也遇到过同样的问题。为了充分利用可用的CPU电源

我有一大堆文件需要进行一些繁重的处理。
这种处理在单线程中，使用了几百个MiB的RAM(在机器上用于启动作业)，并需要几分钟的时间运行。
我目前的使用条件是在输入数据上启动hadoop作业，但在其他情况下，我也遇到过同样的问题。

为了充分利用可用的CPU电源，我希望能够以paralell运行这些任务。

然而，这样一个非常简单的示例shell脚本会由于过载和交换而导致系统性能恶化：

find . -type f | while read name ; 
do 
   some_heavy_processing_command ${name} &
done

所以我想要的是基本上类似于“gmake -j4”。

我知道bash支持“等待”命令，但只等待所有的子进程完成。在过去，我创建了一个脚本，执行一个“ps”命令，然后通过名称grep子进程(是的，我知道…丑陋)。

什么是最简单/最干净/最好的解决方案来做我想要的？

编辑：感谢Frederik：是的确这是How to limit number of threads used in a function in bash的重复
“xargs –max-procs = 4”的作用就像一个魅力。
(所以我投票结束我自己的问题)

#! /usr/bin/env bash

set -o monitor 
# means: run background processes in a separate processes...
trap add_next_job CHLD 
# execute add_next_job when we receive a child complete signal

todo_array=($(find . -type f)) # places output into an array

index=0
max_jobs=2

function add_next_job {
    # if still jobs to do then add one
    if [[ $index -lt ${#todo_array[*]} ]]
    # apparently stackoverflow doesn't like bash syntax
    # the hash in the if is not a comment - rather it's bash awkward way of getting its length
    then
        echo adding job ${todo_array[$index]}
        do_job ${todo_array[$index]} & 
        # replace the line above with the command you want
        index=$(($index+1))
    fi
}

function do_job {
    echo "starting job $1"
    sleep 2
}

# add initial set of jobs
while [[ $index -lt $max_jobs ]]
do
    add_next_job
done

# wait for all jobs to complete
wait
echo "done"

说Fredrik做出了很好的一点，xargs完全是你想要的…

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!