accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
Date Wed, 22 Feb 2017 04:24:22 GMT
+1 to that. Great suggestion, Mike, and great find, Matt!

I think this would be a great thing to capture in the Accumulo User 
Manual if you're interested..

http://accumulo.apache.org/1.8/accumulo_user_manual.html#_troubleshooting

Michael Wall wrote:
> Hi Matt,
>
> Glad you got the metadata table to come up.  So some more questions for you.
>
> How many nodes do you have?
> How many tservers?
> How many tablets are hosted per tserver across all tables?
>
> If you deleted a table, those entries in the metadata table should be
> gone.  Are you still seeing stuff from the deleted table in the metadata
> table?  If all metadata entries are in one tablet, then there are no
> splits for the metadata table and running merge will not help.  After we
> see the answers to the questions above, I will try to recommend
> something else.
>
> Mike
>
> On Tue, Feb 21, 2017 at 6:22 PM Dickson, Matt MR
> <matt.dickson@defence.gov.au <mailto:matt.dickson@defence.gov.au>> wrote:
>
>     __
>
>     *UNOFFICIAL*
>
>     Firstly, thankyou for your advice its been very helpful.
>     Increasing the tablet server memory has allowed the metadata table
>     to come online.  From using the rfile-info and looking at the splits
>     for the metadata table it appears that all the metadata table
>     entries are in one tablet.  All tablet servers then query the one
>     node hosting that tablet.
>     I suspect the cause of this was a poorly designed table that at one
>     point the Accumulo gui reported 1.02T tablets for.  We've
>     subsequently deleted that table but it might be that there were so
>     many entries in the metadata table that all splits on it were due to
>     this massive table that had the table id 1vm.
>     To rectify this, is it safe to run a merge on the metadata table to
>     force it to redistribute?
>
>     ------------------------------------------------------------------------
>     *From:* Michael Wall [mailto:mjwall@gmail.com
>     <mailto:mjwall@gmail.com>]
>     *Sent:* Wednesday, 22 February 2017 02:44
>
>     *To:* user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>     *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>     Matt,
>
>     If I am reading this correctly, you have a tablet that is being
>     loading onto a tserver.  That tserver dies, so the tablet is then
>     assigned to another tablet.  While the tablet is being loading, that
>     tserver dies and so on.  Is that correct?
>
>     Can you identify the tablet that is bouncing around?  If so, try
>     using rfile-info -d to inspect the rfiles associated with that
>     tablet.  Also look at the rfiles that compose that tablet to see if
>     anything sticks out.
>
>     Any logs that would help explain why the tablet server is dying?
>     Can you increase the memory of the tserver?
>
>     Mike
>
>     On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <josh.elser@gmail.com
>     <mailto:josh.elser@gmail.com>> wrote:
>
>         ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
>         communicating with ZooKeeper, will retry
>         SessionExpiredException: KeeperErrorCode = Session expired for
>         /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>
>         There can be a number of causes for this, but here are the most
>         likely ones.
>
>         * JVM gc pauses
>         * ZooKeeper max client connections
>         * Operating System/Hardware-level pauses
>
>         The former should be noticeable by the Accumulo log. There is a
>         daemon
>         running which watches for pauses that happen and then reports
>         them. If
>         this is happening, you might have to give the process some more Java
>         heap, tweak your CMS/G1 parameters, etc.
>
>         For maxClientConnections, see
>         https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html
>
>         For the latter, swappiness is the most likely candidate
>         (assuming this
>         is hopping across different physical nodes), as are "transparent
>         huge
>         pages". If it is limited to a single host, things like bad NICs,
>         hard
>         drives, and other hardware issues might be a source of slowness.
>
>         On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
>         <matt.dickson@defence.gov.au
>         <mailto:matt.dickson@defence.gov.au>> wrote:
>          > UNOFFICIAL
>          >
>          > It looks like an issue with one of the metadata table
>         tablets. On startup
>          > the server that hosts a particular metadata tablet gets
>         scanned by all other
>          > tablet servers in the cluster.  This then crashes that tablet
>         server with an
>          > error in the tserver log;
>          >
>          > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
>          > communicating with ZooKeeper, will retry
>          > SessionExpiredException: KeeperErrorCode = Session expired for
>          >
>         /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>          >
>          > That metadata table tablet is then transferred to another
>         host which then
>          > fails also, and so on.
>          >
>          > While the server is hosting this metadata tablet, we see the
>         following log
>          > statement from all tserver.logs in the cluster:
>          >
>          > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
>          > org.apache.thrift.transport.TTransportException  null
>          > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997
>         <http://server.com.org:9997>,2342423df12341d)
>          > Hope that helps complete the picture.
>          >
>          >
>          > ________________________________
>          > From: Christopher [mailto:ctubbsii@apache.org
>         <mailto:ctubbsii@apache.org>]
>          > Sent: Tuesday, 21 February 2017 13:17
>          >
>          > To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>          > Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >
>          > Removing them is probably a bad idea. The root table entries
>         correspond to
>          > split points in the metadata table. There is no need for the
>         tables which
>          > existed when the metadata table split to still exist for this
>         to continue to
>          > act as a valid split point.
>          >
>          > Would need to see the exception stack trace, or at least an
>         error message,
>          > to troubleshoot the shell scanning error you saw.
>          >
>          >
>          > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR
>         <matt.dickson@defence.gov.au <mailto:matt.dickson@defence.gov.au>>
>          > wrote:
>          >>
>          >> UNOFFICIAL
>          >>
>          >> In case it is ok to remove these from the root table, how
>         can I scan the
>          >> root table for rows with a rowid starting with !0;1vm?
>          >>
>          >> Running "scan -b !0;1vm" throws an exception and exits the
>         shell.
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au
>         <mailto:matt.dickson@defence.gov.au>]
>          >> Sent: Tuesday, 21 February 2017 09:30
>          >> To: 'user@accumulo.apache.org <mailto:user@accumulo.apache.org>'
>          >> Subject: RE: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> UNOFFICIAL
>          >>
>          >>
>          >> Does that mean I should have entries for 1vm in the metadata
>         table
>          >> corresponding to the root table?
>          >>
>          >> We are running 1.6.5
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Josh Elser [mailto:josh.elser@gmail.com
>         <mailto:josh.elser@gmail.com>]
>          >> Sent: Tuesday, 21 February 2017 09:22
>          >> To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>          >> Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> The root table should only reference the tablets in the
>         metadata table.
>          >> It's a hierarchy: like metadata is for the user tables, root
>         is for the
>          >> metadata table.
>          >>
>          >> What version are ya running, Matt?
>          >>
>          >> Dickson, Matt MR wrote:
>          >> > *UNOFFICIAL*
>          >> >
>          >> > I have a situation where all tablet servers are
>         progressively being
>          >> > declared dead. From the logs the tservers report errors like:
>          >> > 2017-02-.... DEBUG: Scan failed thrift error
>          >> > org.apache.thrift.trasport.TTransportException null
>          >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
>         <http://server.com.org:9997>,2342423df12341d)
>          >> > 1vm was a table id that was deleted several months ago so
>         it appears
>          >> > there is some invalid reference somewhere.
>          >> > Scanning the metadata table "scan -b 1vm" returns no rows
>         returned for
>          >> > 1vm.
>          >> > A scan of the accumulo.root table returns approximately 15
>         rows that
>          >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are
>         the root
>          >> > table entries used and would it be safe to remove these
>         entries since
>          >> > they reference a deleted table?
>          >> > Thanks in advance,
>          >> > Matt
>          >> > //
>          >
>          > --
>          > Christopher
>

Mime
View raw message