uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Hafner <reactorm...@gmail.com>
Subject Re: DUCC stuck at WaitingForResources on an Amazon Linux
Date Sat, 15 Nov 2014 01:11:49 GMT
So to run effectively, I would need more memory, because the job wants
two shares? ... Yes. With a larger node it works. What would be a
reasonable memory size for a ducc node?

2014-11-14 9:38 GMT-06:00 Lou DeGenaro <lou.degenaro@gmail.com>:
> Simon,
>
> Congratulations!  You found a bug in DUCC's Web Server.  It was incorrectly
> rounding up when reporting the number of shares for a machine.  This issue
> is addressed by Jira 4104 <https://issues.apache.org/jira/browse/UIMA-4104>.
>
> Lou.
>
> On Fri, Nov 14, 2014 at 7:49 AM, Jim Challenger <challngr@gmail.com> wrote:
>
>> Simon,
>>     It looks like the problem is the amount of RAM on your machine. It's
>> going to be hard to get any meaningful work running on < about 8G.
>>
>>     Here's what to do to get the test job to run on your 4G machine:
>>     1.  In the resources folder, edit ducc.properties and change this:
>>               ducc.jd.host.memory.size=2GB
>>          to this:
>>               ducc.jd.host.memory.size=1GB
>>
>>          This is the amount of RAM that DUCC reserves for itself to manage
>> it's "head" processes.
>>
>>     2.  In the examples/simple folder, edit 1.job and change this:
>>              process_memory_size            2
>>          to this:
>>              process_memory_size            1
>>
>>          This is the amount of memory in GB that the sample 1.job is
>> requesting.
>>
>>      3.  Stop ducc and restart it so the ducc processes reset the
>> jd.host.memory size from the new ducc.properties.
>>
>>      4.  Rerun 1.job and all should be well.
>>
>>       Here are the gory details from the RM log, if you're interested.  In
>> the RM log, I see these lines.
>>
>> 13 Nov 2014 22:04:14,909  INFO RM.NodePool - queryMachines     N/A
>>                  Name       Order Active Shares Unused Shares Memory (MB)
>> Jobs
>> --------------------        ----- ------------- ------------- -----------
>> ------ ...
>> .us-west-2.compute.internal     3 2             1        3955 7 [1]
>>
>> This says you have 3G of **usable-by-ducc** RAM, of which 2G are used by
>> the reservation/job "7", and that you have 1GB free.  The reason you have
>> only 3GB **usable** is that usually the hardware/opsys will reserve a small
>> part of the installed RAM for itself, so the reported RAM is a tad
>> smaller.  To avoid overcommitting the system, we use the reported value,
>> not the installed value.  Most or all of the jobs here will easily
>> overwhelm even the largest machines if we don't do this.
>>
>> Next,  these lines show the actual schedule the RM is trying to build.
>> Dormant:
>>             ID                        JobName       User Class Shares
>> Order QShares NTh Memory nQuest Ques Rem InitWait Max P/Nst
>>    J_________8                     Test_job_1       ducc normal      0
>>  2       0   2      2     15       15 true         8
>>
>> Reserved:
>>             ID                        JobName       User Class Shares
>> Order QShares NTh Memory nQuest Ques Rem InitWait Max P/Nst
>>    R_________7                     Job_Driver     System JobDriver      1
>>    2       2   0      2      0        0 0         1
>>
>> This confirms that the DUCC reservation "7" occupies 2G, and that job "8"
>> is requesting 2G but is "dormant", i.e. waiting for resources.  Since there
>> is only 3G available on this machine, job 8 will wait.
>>
>> Best,
>> Jim
>>
>>
>>
>>
>>
>>

Mime
View raw message