accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
Date Wed, 22 Feb 2017 04:45:51 GMT
It should be safe to merge on the metadata table. That was one of the goals
of moving the root tablet into its own table. I'm pretty sure we have a
build test to ensure it works.

On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR <matt.dickson@defence.gov.au>
wrote:

> *UNOFFICIAL*
> Firstly, thankyou for your advice its been very helpful.
>
> Increasing the tablet server memory has allowed the metadata table to come
> online.  From using the rfile-info and looking at the splits for the
> metadata table it appears that all the metadata table entries are in one
> tablet.  All tablet servers then query the one node hosting that tablet.
>
> I suspect the cause of this was a poorly designed table that at one point
> the Accumulo gui reported 1.02T tablets for.  We've subsequently deleted
> that table but it might be that there were so many entries in the metadata
> table that all splits on it were due to this massive table that had the
> table id 1vm.
>
> To rectify this, is it safe to run a merge on the metadata table to force
> it to redistribute?
>
> ------------------------------
> *From:* Michael Wall [mailto:mjwall@gmail.com]
> *Sent:* Wednesday, 22 February 2017 02:44
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> Matt,
>
> If I am reading this correctly, you have a tablet that is being loading
> onto a tserver.  That tserver dies, so the tablet is then assigned to
> another tablet.  While the tablet is being loading, that tserver dies and
> so on.  Is that correct?
>
> Can you identify the tablet that is bouncing around?  If so, try using
> rfile-info -d to inspect the rfiles associated with that tablet.  Also look
> at the rfiles that compose that tablet to see if anything sticks out.
>
> Any logs that would help explain why the tablet server is dying?  Can you
> increase the memory of the tserver?
>
> Mike
>
> On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <josh.elser@gmail.com> wrote:
>
> ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> communicating with ZooKeeper, will retry
> SessionExpiredException: KeeperErrorCode = Session expired for
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>
> There can be a number of causes for this, but here are the most likely
> ones.
>
> * JVM gc pauses
> * ZooKeeper max client connections
> * Operating System/Hardware-level pauses
>
> The former should be noticeable by the Accumulo log. There is a daemon
> running which watches for pauses that happen and then reports them. If
> this is happening, you might have to give the process some more Java
> heap, tweak your CMS/G1 parameters, etc.
>
> For maxClientConnections, see
>
> https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html
>
> For the latter, swappiness is the most likely candidate (assuming this
> is hopping across different physical nodes), as are "transparent huge
> pages". If it is limited to a single host, things like bad NICs, hard
> drives, and other hardware issues might be a source of slowness.
>
> On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
> <matt.dickson@defence.gov.au> wrote:
> > UNOFFICIAL
> >
> > It looks like an issue with one of the metadata table tablets. On startup
> > the server that hosts a particular metadata tablet gets scanned by all
> other
> > tablet servers in the cluster.  This then crashes that tablet server
> with an
> > error in the tserver log;
> >
> > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> > communicating with ZooKeeper, will retry
> > SessionExpiredException: KeeperErrorCode = Session expired for
> >
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
> >
> > That metadata table tablet is then transferred to another host which then
> > fails also, and so on.
> >
> > While the server is hosting this metadata tablet, we see the following
> log
> > statement from all tserver.logs in the cluster:
> >
> > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
> > org.apache.thrift.transport.TTransportException  null
> > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
> > Hope that helps complete the picture.
> >
> >
> > ________________________________
> > From: Christopher [mailto:ctubbsii@apache.org]
> > Sent: Tuesday, 21 February 2017 13:17
> >
> > To: user@accumulo.apache.org
> > Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >
> > Removing them is probably a bad idea. The root table entries correspond
> to
> > split points in the metadata table. There is no need for the tables which
> > existed when the metadata table split to still exist for this to
> continue to
> > act as a valid split point.
> >
> > Would need to see the exception stack trace, or at least an error
> message,
> > to troubleshoot the shell scanning error you saw.
> >
> >
> > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR <
> matt.dickson@defence.gov.au>
> > wrote:
> >>
> >> UNOFFICIAL
> >>
> >> In case it is ok to remove these from the root table, how can I scan the
> >> root table for rows with a rowid starting with !0;1vm?
> >>
> >> Running "scan -b !0;1vm" throws an exception and exits the shell.
> >>
> >>
> >> -----Original Message-----
> >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au]
> >> Sent: Tuesday, 21 February 2017 09:30
> >> To: 'user@accumulo.apache.org'
> >> Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >>
> >> UNOFFICIAL
> >>
> >>
> >> Does that mean I should have entries for 1vm in the metadata table
> >> corresponding to the root table?
> >>
> >> We are running 1.6.5
> >>
> >>
> >> -----Original Message-----
> >> From: Josh Elser [mailto:josh.elser@gmail.com]
> >> Sent: Tuesday, 21 February 2017 09:22
> >> To: user@accumulo.apache.org
> >> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >>
> >> The root table should only reference the tablets in the metadata table.
> >> It's a hierarchy: like metadata is for the user tables, root is for the
> >> metadata table.
> >>
> >> What version are ya running, Matt?
> >>
> >> Dickson, Matt MR wrote:
> >> > *UNOFFICIAL*
> >> >
> >> > I have a situation where all tablet servers are progressively being
> >> > declared dead. From the logs the tservers report errors like:
> >> > 2017-02-.... DEBUG: Scan failed thrift error
> >> > org.apache.thrift.trasport.TTransportException null
> >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
> ,2342423df12341d)
> >> > 1vm was a table id that was deleted several months ago so it appears
> >> > there is some invalid reference somewhere.
> >> > Scanning the metadata table "scan -b 1vm" returns no rows returned for
> >> > 1vm.
> >> > A scan of the accumulo.root table returns approximately 15 rows that
> >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root
> >> > table entries used and would it be safe to remove these entries since
> >> > they reference a deleted table?
> >> > Thanks in advance,
> >> > Matt
> >> > //
> >
> > --
> > Christopher
>
> --
Christopher

Mime
View raw message