accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Losing tservers - Unusually high Last Contact times
Date Wed, 21 May 2014 17:09:50 GMT
On 5/21/14, 12:00 PM, thomasa wrote:
> Increasing the timeout settings helped a little, but when I tried to increase
> the number of map tasks for the workers I ran into instability issues.
>
> After re-reading my original post, I think I left out some important
> details. The type of job I am trying to run is a map reduce ingest that uses
> batch writers to populate an accumulo table. On previous, smaller clouds, I
> have had control of disk allocation and made sure to assign a disk per
> worker to avoid write conflicts. On this larger cloud, the disk management
> is transparent to me, but I believe the physical disks backing the vms are
> seen as one large virtual pool. Write times on the big, unstable cloud are
> very fast, 3-4xtimes that of our smaller clouds, but that is seen when I dd
> a file on just one vm. I think when all 150+ nodes are writing to disk, more
> than one node will try to write to the same physical disk and cause
> problematic iowait% (20-50% at least).

You could always try your `dd` trick across many nodes at once using 
pdsh or pssh. That may be a quick way to confirm your hypothesis.

> So, given my situation, what is the best way to configure accumulo knowing
> that the workers share disks and will have write conflicts? Do I just bump
> resources down for ingest for stability then ramp them up for non-ingest
> jobs?

The simple change you could make would be to just reduce the amount of 
memory available for each NodeManager to use 
(yarn.nodemanager.resource.memory-mb in yarn-site.xml), which in turn, 
would reduce the number of concurrent Containers run by the 
NodeManagers, and ultimately reduce the amount of data being sent to 
Accumulo.

Depending on the data and your ingest process, there may be more you can 
do on each client, but that's getting a bit into the weeds.

>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Losing-tservers-Unusually-high-Last-Contact-times-tp9950p10005.html
> Sent from the Users mailing list archive at Nabble.com.
>

Mime
View raw message