accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ott, Charles H." <CHARLES.H....@saic.com>
Subject RE: Uneven distribute of Hosted Tablets?
Date Fri, 31 May 2013 17:37:19 GMT
2013-05-31 09:37:03,549 [tabletserver.TabletServer] DEBUG: Unassigning
12<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:03,667 [tabletserver.TabletServer] DEBUG: Unassigning
14<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:03,697 [tabletserver.TabletServer] DEBUG: Unassigning
16<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:03,751 [tabletserver.TabletServer] DEBUG: Unassigning
18<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:03,785 [tabletserver.TabletServer] DEBUG: Unassigning
1<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:03,824 [tabletserver.TabletServer] DEBUG: Unassigning
1b<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:03,868 [tabletserver.TabletServer] DEBUG: Unassigning
1c<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:03,893 [tabletserver.TabletServer] DEBUG: Unassigning
1d<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:03,919 [tabletserver.TabletServer] DEBUG: Unassigning
2<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:03,940 [tabletserver.TabletServer] DEBUG: Unassigning
4<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:03,969 [tabletserver.TabletServer] DEBUG: Unassigning
7<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:03,997 [tabletserver.TabletServer] DEBUG: Unassigning
9<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,014 [tabletserver.TabletServer] DEBUG: Unassigning
a<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,049 [tabletserver.TabletServer] DEBUG: Unassigning
d<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,071 [tabletserver.TabletServer] DEBUG: Unassigning
g<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,119 [tabletserver.TabletServer] DEBUG: Unassigning
i<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,145 [tabletserver.TabletServer] DEBUG: Unassigning
j<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,183 [tabletserver.TabletServer] DEBUG: Unassigning
k<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,210 [tabletserver.TabletServer] DEBUG: Unassigning
l<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,235 [tabletserver.TabletServer] DEBUG: Unassigning
o<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,260 [tabletserver.TabletServer] DEBUG: Unassigning
p<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,284 [tabletserver.TabletServer] DEBUG: Unassigning
u<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,306 [tabletserver.TabletServer] DEBUG: Unassigning
z<<@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:04,686 [tabletserver.TabletServer] DEBUG: Unassigning
!0<;~@(null,10.35.58.81:9997[13ec2e209c79745],null)

2013-05-31 09:37:48,675 [server.Accumulo] INFO :
tserver.bulk.assign.threads = 1

2013-05-31 09:37:57,765 [tabletserver.TabletServer] INFO :
1620-accumulo.dhcp.saic.com/10.35.58.81:9997: got assignment from
master: !0;!0<<

2013-05-31 09:37:57,775 [tabletserver.TabletServer] INFO : Reporting
tablet !0;!0<< assignment failure: unable to verify Tablet Information

2013-05-31 09:37:57,961 [tabletserver.TabletServer] INFO :
1620-accumulo.dhcp.saic.com/10.35.58.81:9997: got assignment from
master: !0;!0<<

2013-05-31 09:37:48,675 [server.Accumulo] INFO :
tserver.bulk.assign.threads = 1

2013-05-31 09:37:57,765 [tabletserver.TabletServer] INFO :
1620-accumulo.dhcp.saic.com/10.35.58.81:9997: got assignment from
master: !0;!0<<

2013-05-31 09:37:57,775 [tabletserver.TabletServer] INFO : Reporting
tablet !0;!0<< assignment failure: unable to verify Tablet Information

2013-05-31 09:37:57,961 [tabletserver.TabletServer] INFO :
1620-accumulo.dhcp.saic.com/10.35.58.81:9997: got assignment from
master: !0;!0<<

 

From: user-return-2648-CHARLES.H.OTT=saic.com@accumulo.apache.org
[mailto:user-return-2648-CHARLES.H.OTT=saic.com@accumulo.apache.org] On
Behalf Of Billie Rinaldi
Sent: Friday, May 31, 2013 1:32 PM
To: user@accumulo.apache.org
Subject: Re: Uneven distribute of Hosted Tablets?

 

Hmm.  Anything on the one that reported assignment failed?

Billie

 

On Fri, May 31, 2013 at 9:53 AM, Ott, Charles H.
<CHARLES.H.OTT@saic.com> wrote:

2013-05-31 09:49:53,471 [tabletserver.TabletServer] DEBUG: Got
unloadTablet message from user: !SYSTEM

2013-05-31 09:49:53,471 [tabletserver.Tablet] DEBUG:
initiateClose(saveState=true queueMinC=false disableWrites=false)
!0;!0<<

2013-05-31 09:49:53,471 [tabletserver.TabletServer] DEBUG: Failed to
unload tablet !0;!0<<... it was alread closing or closed : Tablet
!0;!0<< already closing

 

The timestamp is 12 minutes off, since the clocks are out of sync,  but
there seems to be the same number of debug statements above as there
were errors in the master.

 

From: user-return-2646-CHARLES.H.OTT=saic.com@accumulo.apache.org
[mailto:user-return-2646-CHARLES.H.OTT=saic.com@accumulo.apache.org] On
Behalf Of Billie Rinaldi
Sent: Friday, May 31, 2013 12:47 PM


To: user@accumulo.apache.org
Subject: Re: Uneven distribute of Hosted Tablets?

 

Can you go to one of those servers that is reporting unload / assignment
failed and check its tserver log to see why it failed?

Billie

 

On Fri, May 31, 2013 at 9:39 AM, Ott, Charles H.
<CHARLES.H.OTT@saic.com> wrote:

I am not sure if I am using one of the balancers that comes with
Accumulo.  There are some errors in my logs for the master since I did
the clean shutdown/startup this morning:

 

2013-05-31 09:37:57,592 [master.Master] ERROR: 10.35.56.92:9997 reports
unload failed for tablet !0;!0<< (A lot of these errors showed up)

 

2013-05-31 09:37:57,795 [master.Master] ERROR: 10.35.58.81:9997 reports
assignment failed for tablet !0;!0<< (only one of these)

 

2013-05-31 09:37:05,784 [master.Master] ERROR:
master:1620-accumulo.dhcp.saic.com 10.35.56.92:9997 reports unload
failed for tablet !0;!0<< (a lot of these)

 

The entire batch of errors all occurred within 1 minute.  Then they
don't occur anymore.

 

 

 

From: user-return-2644-CHARLES.H.OTT=saic.com@accumulo.apache.org
[mailto:user-return-2644-CHARLES.H.OTT=saic.com@accumulo.apache.org] On
Behalf Of Billie Rinaldi
Sent: Friday, May 31, 2013 12:14 PM


To: user@accumulo.apache.org
Subject: Re: Uneven distribute of Hosted Tablets?

 

So (at the risk of stating the obvious) it seems like your cluster is in
a funny state.  I would expect the counts in the "Hosted Tablets" column
to all be roughly the same, especially after restarting the master,
assuming you're using one of the balancers that comes with Accumulo.
It's possible the cluster has gotten into this state due to the clock
differences.  Accumulo has a mechanism called "logical time" to deal
with clock differences, but it is not enabled by default.  You can
enable it when you create a table.  If you don't enable this it is
recommended that you use NTP to synchronize the clocks on your cluster.
The !METADATA table has logical time by default, but your other tables
might not contain what you expect them to if you haven't enabled logical
time.

That said, I'm not sure why the clock issue would be affecting the
balancing.  You mentioned the new warnings you saw on the monitor page
after you restarted the system.  Could you see if there are any older
errors in your log files?

Billie

 

On Fri, May 31, 2013 at 8:10 AM, Ott, Charles H.
<CHARLES.H.OTT@saic.com> wrote:

-bash-4.1$ ssh 1620-accumulo

-bash-4.1$ date

Fri May 31 10:52:49 EDT 2013

 

-bash-4.1$ ssh 1620-Node1

-bash-4.1$ date

Fri May 31 11:05:48 EDT 2013

 

-bash-4.1$ ssh 1620-Node2

-bash-4.1$ date

Fri May 31 11:05:58 EDT 2013

 

-bash-4.1$ ssh 1620-Node3

-bash-4.1$ date

Fri May 31 11:05:58 EDT 2013

 

Looks like the master(1620-accumulo) and it's tablet server are 12-13
minutes behind the nodes.  I'm not sure my
zookeeper+Hadoop+Accumulo+storm+Kafka stack will appreciate moving
forward in time 12 minutes.  

 

From: user-return-2642-CHARLES.H.OTT=saic.com@accumulo.apache.org
[mailto:user-return-2642-CHARLES.H.OTT=saic.com@accumulo.apache.org] On
Behalf Of Billie Rinaldi
Sent: Friday, May 31, 2013 11:02 AM
To: user@accumulo.apache.org


Subject: Re: Uneven distribute of Hosted Tablets?

 

Those last contact times are concerning as well.  Have they always
looked like that?  I notice they were roughly the same on your first
screenshot.  Are your server clocks not in sync?

Billie

 

On Fri, May 31, 2013 at 7:00 AM, Ott, Charles H.
<CHARLES.H.OTT@saic.com> wrote:

I performed a clean shutdown and startup of all the processes using the
start-all.sh/stop-all.sh scripts.

 

The systems have only been online for about 5 minutes and everything is
working.  But I see the following Recent WARN in the Logs:

 

time                                       application
count    level      message

31 09:37:57,0774               tserver:1620-accumulo  1
WARN   Future location is not to this server for the root tablet

 

Hosted tablet distribution seems to be worse:

 

(Image Below Here)


(Image Above Here)

 

I am able to login and scans seems to be responsive.   I noticed that
when we had our entries ~20 M count, our batch scans were taking much
longer.  I was hoping that by distributing the tablets evenly, and
splitting some of the bigger tables, we could get better performance.

As for splitting the bigger table, I received a message from a peer.  He
mentioned that I could create a new table and split it on the values I
want.  Then use Map reduce job to move the data from the single tablet
table to split table.  

 

From: user-return-2638-CHARLES.H.OTT=saic.com@accumulo.apache.org
[mailto:user-return-2638-CHARLES.H.OTT=saic.com@accumulo.apache.org] On
Behalf Of John Vines
Sent: Thursday, May 30, 2013 5:30 PM
To: user@accumulo.apache.org
Cc: Lahr-Vivaz, Emilio F.


Subject: Re: Uneven distribute of Hosted Tablets?

 

Your distribution is cause for concern. I thought we had resolved a lot
of the balancer issues in 1.4.1 or 1.4.2. Are you seeing any errors from
the master in your logs? Worst case scenario is you just have to kill
the master process and start it back up and you should see things
balancing out.

 

On Thu, May 30, 2013 at 4:40 PM, Ott, Charles H.
<CHARLES.H.OTT@saic.com> wrote:

Thanks for the feedback.  I will keep what you said in mind.

 

From: user-return-2636-CHARLES.H.OTT=saic.com@accumulo.apache.org
[mailto:user-return-2636-CHARLES.H.OTT=saic.com@accumulo.apache.org] On
Behalf Of David Medinets
Sent: Thursday, May 30, 2013 4:34 PM
To: accumulo-user
Subject: Re: Uneven distribute of Hosted Tablets?

 

Don't worry about splits until you have a few billion entries and a lot
more servers. What you're seeing now is just a bad signal to noise
ratio.

 

On Thu, May 30, 2013 at 11:22 AM, Ott, Charles H.
<CHARLES.H.OTT@saic.com> wrote:

First I want to say thanks to the you all.  The information provided by
this mailing list has been invaluable to me and I appreciate it.

 

My newest concern is the uneven allocation of hosted tablets across my
tablet servers:

 

(Image Pasted below here)

(Image Pasted above here)

 

I have been reading about pre-splitting tables in the Accumulo guide.
But I am not sure if that would be the 'fix' for this.  (Or even if this
needs fixing.)

 

I have 3 tables that could potentially grow to n number of records.
Currently of those tables (and there single tablet) reside on the
1620-accumulo server (Hosting 24 tablets).

 

Since there is already several entries on those tables, would splitting
them be appropriate?  Does splitting guarantee that the new tablets will
be allocated to Node1 instead of Node 3? Or perhaps could I "re-balance"
the cluster so that all of the tablet servers host an approximately
equal number of tablets?

 

These tablet servers were all brought up at separate times and I have
not performed any optimizations or custom operations on them.

 

 

Thanks,

Charles

 

 

 

 

 

 

 

 


Mime
View raw message