Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (athena.apache.org: domain of tejay.e.cardon@lmco.com
 designates 192.31.106.12 as permitted sender)
Date: Fri, 21 Sep 2012 14:50:10 +0000
From: "Cardon, Tejay E" <tejay.e.cardon@lmco.com>
Subject: RE: EXTERNAL: Re: Failing Tablet Servers
In-reply-to: 
 <CADxc9BnAGNFO=2EotJgmJjU-tNUd8L0cR1fY7GHqHrYrqvEknQ@mail.gmail.com>
To: "user@accumulo.apache.org" <user@accumulo.apache.org>
Message-id: <57754A39E408FB4994D100B1F4409AD80BE71E21@HDXDSP33.us.lmco.com>
MIME-version: 1.0
Content-type: multipart/alternative;
 boundary="Boundary_(ID_kJr2Ipdtzam+b1pIx/ZIjw)"
Content-language: en-US
Thread-Topic: EXTERNAL: Re: Failing Tablet Servers
Thread-Index: Ac2Xc7avSqHaW9BiSki+HQC1xTe+rAAwG20AAAxuqWD//6TCgIAAYESg
Accept-Language: en-US
References: <57754A39E408FB4994D100B1F4409AD80BE71BC3@HDXDSP33.us.lmco.com>
 <CADxc9Bn-Wwx04tf=w1rJDP_ssij3MOPwV57FVuhqybD8XpsxZw@mail.gmail.com>
 <57754A39E408FB4994D100B1F4409AD80BE71DC0@HDXDSP33.us.lmco.com>
 <CADxc9BnAGNFO=2EotJgmJjU-tNUd8L0cR1fY7GHqHrYrqvEknQ@mail.gmail.com>


--Boundary_(ID_kJr2Ipdtzam+b1pIx/ZIjw)
Content-type: text/plain; charset=us-ascii
Content-transfer-encoding: 7BIT

Alright.  So I'm changing it to:

1.        moderate size mutations ~1,000 key/values per mutation.

2.       Tserver_opts = 5g

3.       Memory.maps = 3g

4.       Swappiness = 0 (right now I'm at 20)

It sounds like those are all settings I should fix anyway, so we'll do them all.  I'll report back if that doesn't fix the problem.

Thanks again for all the help

Tejay Cardon
From: Eric Newton [mailto:eric.newton@gmail.com]
Sent: Friday, September 21, 2012 8:33 AM
To: user@accumulo.apache.org
Subject: Re: EXTERNAL: Re: Failing Tablet Servers

We regularly send an overwhelming number of small key/value pairs to tablet servers (see the continuous ingest test).

If your servers are going down with smaller mutations, send the logs again.  I suspect that the tserver is being pushed into swap, and then the GC is taking too long.  That causes the tserver to lose its lock in zookeeper.

Make sure that swappiness is set to zero.

-Eric
On Fri, Sep 21, 2012 at 10:12 AM, Cardon, Tejay E <tejay.e.cardon@lmco.com<mailto:tejay.e.cardon@lmco.com>> wrote:
Jim, Eric, and Adam,
Thanks.  It sounds like you're all saying the same thing.  Originally I was doing each key/value as its own mutation, and it was blowing up much faster (probably due to the volume/overhead of the mutation objects themselves.  I'll try refactoring to break them up into something in-between.  My keys are small (<25 Bytes), and my values are empty, but I'll aim for ~1,000 key/values per mutation and see how that works out for me.

Eric,
I was under the impression that the memory.maps setting was not very important when using native maps.  Apparently I'm mistaken there.  What does this setting control when in a native map setting?  And, in general, what's the proper balance between tserver_opts and tserver.memory.maps?

With regards to the "Finished gathering information from 24 servers in 27.45 seconds"  Do you have any recommendations for how to chase down the bottleneck?  I'm pretty sure I'm having GC issues, but I'm not sure what is causing them on the server side.  I'm sending a fairly small number of very large mutation objects, which I'd expect to be a moderate problem for the GC, but not a huge one..

Thanks again to everyone for being so responsive and helpful.

Tejay Cardon


From: Eric Newton [mailto:eric.newton@gmail.com<mailto:eric.newton@gmail.com>]
Sent: Friday, September 21, 2012 8:03 AM

To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: EXTERNAL: Re: Failing Tablet Servers

A few items noted from your logs:

tserver.memory.maps.max = 1G

If you are giving your processes 10G, you might want to make the map larger, say 6G, and then reduce the JVM by 6G.

Write-Ahead Log recovery complete for rz<;zw== (8 mutations applied, 8000000 entries created)

You are creating rows with 1M columns.  This is ok, but you might want to write them out more incrementally.

WARN : Running low on memory

That's pretty self-explanatory.  I'm guessing that the very large mutations are causing the tablet servers to run out of memory before they are held waiting for minor compactions.

Finished gathering information from 24 servers in 27.45 seconds

Something is running slow, probably due to GC thrashing.

WARN : Lost servers [10.1.24.69:9997[139d46130344b98]]

And there's a server crashing, probably due to an OOM condition.

Send smaller mutations.  Maybe keep it to 200K column updates.  You can still have 1M wide rows, just send 5 mutations.

-Eric

On Thu, Sep 20, 2012 at 5:05 PM, Cardon, Tejay E <tejay.e.cardon@lmco.com<mailto:tejay.e.cardon@lmco.com>> wrote:
I'm seeing some strange behavior on a moderate (30 node) cluster.  I've got 27 tablet servers on large dell servers with 30GB of memory each.  I've set the TServer_OPTS to give them each 10G of memory.  I'm running an ingest process that uses AccumuloInputFormat in a MapReduce job to write 1,000 rows with each row containing ~1,000,000 columns in 160,000 families.  The MapReduce initially runs quite quickly and I can see the ingest rate peak on the  monitor page.  However, after about 30 seconds of high ingest, the ingest falls to 0.  It then stalls out and my map task are eventually killed.  In the end, the map/reduce fails and I usually end up with between 3 and 7 of my Tservers dead.

Inspecting the tserver.err logs shows nothing, even on the nodes that fail.  The tserver.out log shows a java OutOfMemoryError, and nothing else.  I've included a zip with the logs from one of the failed tservers and a second one with the logs from the master.  Other than the out of memory, I'm not seeing anything that stands out to me.

If I reduce the data size to only 100,000 columns, rather than 1,000,000, the process takes about 4 minutes and completes without incident.

Am I just ingesting too quickly?

Thanks,
Tejay Cardon


--Boundary_(ID_kJr2Ipdtzam+b1pIx/ZIjw)
Content-type: text/html; charset=us-ascii
Content-transfer-encoding: 7BIT

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 12 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p
	{mso-style-priority:99;
	mso-margin-top-alt:auto;
	margin-right:0in;
	mso-margin-bottom-alt:auto;
	margin-left:0in;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	{mso-style-priority:99;
	mso-style-link:"Balloon Text Char";
	margin:0in;
	margin-bottom:.0001pt;
	font-size:8.0pt;
	font-family:"Tahoma","sans-serif";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{mso-style-priority:34;
	margin-top:0in;
	margin-right:0in;
	margin-bottom:0in;
	margin-left:.5in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
span.EmailStyle18
	{mso-style-type:personal-reply;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
span.BalloonTextChar
	{mso-style-name:"Balloon Text Char";
	mso-style-priority:99;
	mso-style-link:"Balloon Text";
	font-family:"Tahoma","sans-serif";}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
/* List Definitions */
@list l0
	{mso-list-id:658922523;
	mso-list-type:hybrid;
	mso-list-template-ids:1910131598 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
	{mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
ol
	{margin-bottom:0in;}
ul
	{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Alright.&nbsp; So I&#8217;m changing it to:<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><span style="mso-list:Ignore">1.<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><![endif]><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">&nbsp;moderate size mutations ~1,000 key/values per mutation.<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><span style="mso-list:Ignore">2.<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><![endif]><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Tserver_opts = 5g<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><span style="mso-list:Ignore">3.<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><![endif]><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Memory.maps = 3g<o:p></o:p></span></p>
<p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><span style="mso-list:Ignore">4.<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span><![endif]><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Swappiness = 0 (right now I&#8217;m at 20)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">It sounds like those are all settings I should fix anyway, so we&#8217;ll do them all.&nbsp; I&#8217;ll report back if that doesn&#8217;t fix the problem.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Thanks again for all the help<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Tejay Cardon<o:p></o:p></span></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Eric Newton [mailto:eric.newton@gmail.com]
<br>
<b>Sent:</b> Friday, September 21, 2012 8:33 AM<br>
<b>To:</b> user@accumulo.apache.org<br>
<b>Subject:</b> Re: EXTERNAL: Re: Failing Tablet Servers<o:p></o:p></span></p>
</div>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal">We regularly send an overwhelming number of small key/value pairs to tablet servers (see the continuous ingest test).<o:p></o:p></p>
<div>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
<div>
<p class="MsoNormal">If your servers are going down with smaller mutations, send the logs again. &nbsp;I suspect that the tserver is being pushed into swap, and then the GC is taking too long. &nbsp;That causes the tserver to lose its lock in zookeeper.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
<div>
<p class="MsoNormal">Make sure that swappiness is set to zero.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
<div>
<p class="MsoNormal" style="margin-bottom:12.0pt">-Eric<o:p></o:p></p>
<div>
<p class="MsoNormal">On Fri, Sep 21, 2012 at 10:12 AM, Cardon, Tejay E &lt;<a href="mailto:tejay.e.cardon@lmco.com" target="_blank">tejay.e.cardon@lmco.com</a>&gt; wrote:<o:p></o:p></p>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Jim, Eric, and Adam,</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Thanks.&nbsp; It sounds like you&#8217;re all saying the same thing.&nbsp; Originally I was doing each key/value
 as its own mutation, and it was blowing up much faster (probably due to the volume/overhead of the mutation objects themselves.&nbsp; I&#8217;ll try refactoring to break them up into something in-between.&nbsp; My keys are small (&lt;25 Bytes), and my values are empty, but I&#8217;ll
 aim for ~1,000 key/values per mutation and see how that works out for me.</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Eric,</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">I was under the impression that the memory.maps setting was not very important when using native
 maps.&nbsp; Apparently I&#8217;m mistaken there.&nbsp; What does this setting control when in a native map setting?&nbsp; And, in general, what&#8217;s the proper balance between tserver_opts and tserver.memory.maps?</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">&nbsp;</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">With regards to the &#8220;</span>Finished gathering information from 24 servers in 27.45 seconds<span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">&#8221;
 &nbsp;Do you have any recommendations for how to chase down the bottleneck?&nbsp; I&#8217;m pretty sure I&#8217;m having GC issues, but I&#8217;m not sure what is causing them on the server side.&nbsp; I&#8217;m sending a fairly small number of very large mutation objects, which I&#8217;d expect to be
 a moderate problem for the GC, but not a huge one..</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">&nbsp;</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Thanks again to everyone for being so responsive and helpful.</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">&nbsp;</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">Tejay Cardon</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">&nbsp;</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1F497D">&nbsp;</span><o:p></o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><b><span style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Eric Newton [mailto:<a href="mailto:eric.newton@gmail.com" target="_blank">eric.newton@gmail.com</a>]
<br>
<b>Sent:</b> Friday, September 21, 2012 8:03 AM</span><o:p></o:p></p>
<div>
<p class="MsoNormal"><br>
<b>To:</b> <a href="mailto:user@accumulo.apache.org" target="_blank">user@accumulo.apache.org</a><br>
<b>Subject:</b> EXTERNAL: Re: Failing Tablet Servers<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">A few items noted from your logs:<o:p></o:p></p>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">tserver.memory.maps.max = 1G<o:p></o:p></p>
</blockquote>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">If you are giving your processes 10G, you might want to make the map larger, say 6G, and then reduce the JVM by 6G.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Write-Ahead Log recovery complete for rz&lt;;zw== (8 mutations applied, 8000000 entries created)<o:p></o:p></p>
</blockquote>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">You are creating rows with 1M columns. &nbsp;This is ok, but you might want to write them out more incrementally.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">WARN : Running low on memory<o:p></o:p></p>
</blockquote>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">That's pretty self-explanatory. &nbsp;I'm guessing that the very large mutations are causing the tablet servers to run out of memory before they are held waiting for minor compactions.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Finished gathering information from 24 servers in 27.45 seconds<o:p></o:p></p>
</blockquote>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Something is running slow, probably due to GC thrashing.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">WARN : Lost servers [10.1.24.69:9997[139d46130344b98]]<o:p></o:p></p>
</blockquote>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">And there's a server crashing, probably due to an OOM condition.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Send smaller mutations. &nbsp;Maybe keep it to 200K column updates. &nbsp;You can still have 1M wide rows, just send 5 mutations.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">-Eric<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">On Thu, Sep 20, 2012 at 5:05 PM, Cardon, Tejay E &lt;<a href="mailto:tejay.e.cardon@lmco.com" target="_blank">tejay.e.cardon@lmco.com</a>&gt; wrote:<o:p></o:p></p>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I&#8217;m seeing some strange behavior on a moderate (30 node) cluster.&nbsp; I&#8217;ve got 27 tablet servers on large dell servers with 30GB of memory each.&nbsp; I&#8217;ve set the TServer_OPTS to give
 them each 10G of memory.&nbsp; I&#8217;m running an ingest process that uses AccumuloInputFormat in a MapReduce job to write 1,000 rows with each row containing ~1,000,000 columns in 160,000 families.&nbsp; The MapReduce initially runs quite quickly and I can see the ingest
 rate peak on the&nbsp; monitor page.&nbsp; However, after about 30 seconds of high ingest, the ingest falls to 0.&nbsp; It then stalls out and my map task are eventually killed. &nbsp;In the end, the map/reduce fails and I usually end up with between 3 and 7 of my Tservers dead.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Inspecting the tserver.err logs shows nothing, even on the nodes that fail.&nbsp; The tserver.out log shows a java OutOfMemoryError, and nothing else.&nbsp; I&#8217;ve included a zip with the logs
 from one of the failed tservers and a second one with the logs from the master.&nbsp; Other than the out of memory, I&#8217;m not seeing anything that stands out to me.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">If I reduce the data size to only 100,000 columns, rather than 1,000,000, the process takes about 4 minutes and completes without incident.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Am I just ingesting too quickly?<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Thanks,<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Tejay Cardon<o:p></o:p></p>
</div>
</div>
</div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">&nbsp;<o:p></o:p></p>
</div>
</div>
</div>
</div>
</div>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
</div>
</body>
</html>

--Boundary_(ID_kJr2Ipdtzam+b1pIx/ZIjw)--