accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cardon, Tejay E" <tejay.e.car...@lmco.com>
Subject RE: EXTERNAL: Re: Failing Tablet Servers
Date Thu, 20 Sep 2012 22:50:35 GMT
Sorry, yes it's the AccumuloOutputFormat.  I do about 1,000,000 mutation.puts before I do a
context.write.  Any idea how many is safe?

Thanks,
Tejay

From: Jim Klucar [mailto:klucar@gmail.com]
Sent: Thursday, September 20, 2012 4:44 PM
To: user@accumulo.apache.org
Subject: Re: EXTERNAL: Re: Failing Tablet Servers

Do you mean AccumuloOutputFormat? Is the map failing or the reduce failing? How many Mutation.put
are you doing before a context.write? Too many puts will crash the mutation object. You need
to periodically call context.write and create a new mutation object. At some point I wrote
a ContextFlushingMutation that handled this problem for you, but I'd have to dig around for
it or rewrite it.


Sent from my iPhone

On Sep 20, 2012, at 5:29 PM, "Cardon, Tejay E" <tejay.e.cardon@lmco.com<mailto:tejay.e.cardon@lmco.com>>
wrote:
John,
Thanks for the quick response.  I'm not seeing any errors in the logger logs.  I am using
native maps, and I left the memory map size at 1GB.  I assume that's plenty large if I'm using
native maps, right?

Thanks,
Tejay

From: John Vines [mailto:vines@apache.org<mailto:vines@apache.org>]
Sent: Thursday, September 20, 2012 3:20 PM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: EXTERNAL: Re: Failing Tablet Servers

Okay, so we know that you're killing servers. We know when you drop the amount of data down,
you have no issues. There are two immediate issues that come to mind-
1. You modified tservers opts to give them 10G of memory. Did you up the memory map size in
accumulo-site.xml to make those larger, or did you leave those alone? Or did you up them to
match the 10G? If you upped them and arne't using the native maps, that would be problematic
as you need space for other purposes as well.

2. You seem to be making giant rows. Depending on your Key/Value size, it's possible for you
to write a row that you cannot send (especially if using a WholeRowIterator) that can cause
a cascading error when doing log recovery. Are you seeing any sort of errors in your loggers
logs?

John
On Thu, Sep 20, 2012 at 5:05 PM, Cardon, Tejay E <tejay.e.cardon@lmco.com<mailto:tejay.e.cardon@lmco.com>>
wrote:
I'm seeing some strange behavior on a moderate (30 node) cluster.  I've got 27 tablet servers
on large dell servers with 30GB of memory each.  I've set the TServer_OPTS to give them each
10G of memory.  I'm running an ingest process that uses AccumuloInputFormat in a MapReduce
job to write 1,000 rows with each row containing ~1,000,000 columns in 160,000 families. 
The MapReduce initially runs quite quickly and I can see the ingest rate peak on the  monitor
page.  However, after about 30 seconds of high ingest, the ingest falls to 0.  It then stalls
out and my map task are eventually killed.  In the end, the map/reduce fails and I usually
end up with between 3 and 7 of my Tservers dead.

Inspecting the tserver.err logs shows nothing, even on the nodes that fail.  The tserver.out
log shows a java OutOfMemoryError, and nothing else.  I've included a zip with the logs from
one of the failed tservers and a second one with the logs from the master.  Other than the
out of memory, I'm not seeing anything that stands out to me.

If I reduce the data size to only 100,000 columns, rather than 1,000,000, the process takes
about 4 minutes and completes without incident.

Am I just ingesting too quickly?

Thanks,
Tejay Cardon


Mime
View raw message