accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mohit.kaushik" <mohit.kaus...@orkash.com>
Subject Re: Mutation Rejected exception with server Error 1
Date Wed, 23 Dec 2015 13:21:06 GMT

Thanks for the beautiful explanation Eric, so this means that if I get 
Mutations rejected exception due to tablet server failure, the 
batchwriter will resend them to some other server and I do not have 
worry about them. Great...

But what is the case when we get mutations rejected exception and no 
server failure. Today also I faced the mutations rejected exceptions 
with *"server error 1*" due to mainly two reasons. while there is no 
related exception in tablet server logs.
(1) Failed to connect to zookeeper (192.168.10.122) within 2x zookeeper 
timeout period 30000
(2) org.apache.accumulo.core.client.impl.AccumuloServerException: Error 
on server orkash3:9997
/
//org.apache.accumulo.core.client.MutationsRejectedException: # 
constraint violations : 0 security codes: {} # server errors 0 # 
exceptions 1 at 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)

at 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)

at 
org.apache.accumulo.core.client.impl.MultiTableBatchWriterImpl$TableBatchWriter.addMutation(MultiTableBatchWriterImpl.java:64)

at 
com.orkash.accumulo.IngestionWithoutServiceOnCondition.main(IngestionWithoutServiceOnCondition.java:235)

at com.orkash.db.DBQuery.insertLookUpDB(DBQuery.java:570) at 
com.orkash.Crawling.CrawlerThread.run(CrawlerThread.java:145) at 
java.lang.Thread.run(Thread.java:745) Caused by: 
java.lang.RuntimeException://
//Failed to connect to zookeeper (192.168.10.122) within 2x zookeeper 
timeout period 30000 at 
org.apache.accumulo.fate.zookeeper.ZooSession.connect(ZooSession.java:117)//
//
//org.apache.accumulo.core.client.MutationsRejectedException: # 
constraint violations : 0 security codes: {} # server errors 2 # 
exceptions 2 at 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)

at 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)

at 
org.apache.accumulo.core.client.impl.MultiTableBatchWriterImpl$TableBatchWriter.addMutation(MultiTableBatchWriterImpl.java:64)

at 
com.orkash.accumulo.IngestionWithoutServiceOnCondition.main(IngestionWithoutServiceOnCondition.java:235)

at com.orkash.db.DBQuery.insertLookUpDB(DBQuery.java:570) at 
com.orkash.Crawling.CrawlerThread.run(CrawlerThread.java:145) at 
java.lang.Thread.run(Thread.java:745) Caused by: 
org.apache.accumulo.core.client.impl.AccumuloServerException: Error on 
server orkash3:9997 at 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:937)

at 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.access$1600(TabletServerBatchWriter.java:616)/



both exceptions appears at clientside. I have three zookeeper nodes 
(version 3.4.6) deployed on the same nodes at which tservers run. I got 
these exceptions more than 12000 times which I can see on kibana dashboard.

Thanks
Mohit Kaushik


On 12/23/2015 06:22 PM, Eric Newton wrote:
> The accumulo batch writer will re-send mutations if a tablet server 
> fails, or rejects the mutations because the tablet has moved.  There's 
> nothing you have to do to recover from fail-overs and re-balancing.
>
> I'm not a kernel expert, but I believe that a swappiness setting of 
> "1" is equivalent to "0".
>
> The error you are seeing is part of the failing tablet server 
> scenario.  This is a bit complicated, so I'm going to name your three 
> tablet servers A, B and C.
>
> Tablet server A is hosting a tablet, let's call it a-tablet.
> Tablet server B is hosting a metadata tablet, let's call it m-tablet.
> m-tablet records the information about a-tablet:
>
>   * where it is hosted
>   * what files it it has, and their approximate sizes
>   * book-keeping related to bulk ingest
>   * etc.. I think the OReilly Accumulo book has some great details
>
> Now when A ingests some data, it eventually flushes the updates from 
> memory to a file.
> Tablet server A then writes this new information to m-tablet, on 
> Tablet server B.
>
> Now for the failure:
> Tablet server A does a java memory garbage collection, and starts 
> pulling data from swap. That makes it go really slow, and it looses 
> its zookeeper session.
>
> But, it's running so slowly, that it takes a moment to realize it 
> should die.
>
> In the mean time, the thread that is flushing memory, attempts to 
> update m-tablet with the new file information.
>
> Fortunately there's a constraint on m-tablet. The constraint is that 
> mutations must contain a valid zookeeper session. This prevents tablet 
> server A from making updates to m-tablet when it no long has the right 
> to host the tablet.
>
> Your initial error is from tablet server A making an update to tablet 
> server B's m-tablet.  It's getting a constraint violation: tablet 
> server A has lost its zookeeper session, and will fail momentarily.
>
> To make this extra confusing: A and B might be the same server.
>
> -Eric
>
>
> On Tue, Dec 22, 2015 at 11:31 PM, mohit.kaushik 
> <mohit.kaushik@orkash.com <mailto:mohit.kaushik@orkash.com>> wrote:
>
>
>     I have 3 tablet servers having around 1.4K tablets. If a tablet
>     server loses its session with zookeeper and killed itself. The
>     system takes some time to move all hosted tablets to other servers.
>
>     In this case if a ingest in process then what should happen with
>     the mutations going to tablets hosted by that tablet server?
>     Is it the reason for the first exception?Should they not be
>     redirected to other servers?
>     nd I had set the system swappiness to 1. Should I keep it 0 in
>     this case? I will check further.
>
>     Thanks for the reply
>
>     -Mohit Kaushik
>
>
>     On 12/22/2015 08:17 PM, Eric Newton wrote:
>>     A tablet server is given the rights to manage a tablet.
>>
>>     It is critical that no other server uses the tablet to maintain
>>     consistency.
>>
>>     To maintain the right to access a tablet, it must maintain a
>>     zookeeper session. The zookeeper session periodically exchanges
>>     keep-alive messages. If either party fails to get a keep-alive,
>>     zookeeper will close the connection. The client can attempt to
>>     reconnect, but if it fails to do so, the session will timeout.
>>
>>     If the tablet server loses its session with zookeeper, the rest
>>     of the system can take over its tablets.
>>
>>     When a tablet detects that it lost its zookeeper session, it
>>     kills itself to avoid doing anything with the tablets it no long
>>     has the right to host.
>>
>>     What you are seeing here is the first step in that process, and
>>     it is probably due to the tablet server not sending a keep-alive
>>     message to zookeeper in time.
>>
>>     There are many reasons for a tablet server to be delayed in
>>     sending a keep-alive message. By far the most common is that your
>>     system is over-subscribed for memory, and part of the tablet
>>     server's memory swapped out. Once the java garbage collection
>>     cycle swapped it back in, there was a considerable delay.
>>
>>     However, there can be other things going on. This is just a best
>>     guess.  Monitor swap usage, as a first diagnostic step.
>>
>>     -Eric
>>
>>
>>
>>     On Tue, Dec 22, 2015 at 8:30 AM, mohit.kaushik
>>     <mohit.kaushik@orkash.com <mailto:mohit.kaushik@orkash.com>> wrote:
>>
>>         Dear All,
>>
>>         The mutations rejected exception can be seen at client side
>>         with server error 1.
>>         /*org.apache.accumulo.core.client.MutationsRejectedException:
>>         # constraint violations : 0  security codes: {}  # server
>>         errors 1 # exceptions 1\n\tat
>>         org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)\n\tat
>>         org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)\n\tat
>>         org.apache.accumulo.core.client.impl.MultiTableBatchWriterImpl$TableBatchWriter.addMutation(MultiTableBatchWriterImpl.java:64)\n\tat
>>         com.orkash.accumulo.IngestionWithoutServiceOnCondition.main(IngestionWithoutServiceOnCondition.java:235)\n\tat
>>         com.orkash.db.DBQuery.insertLookUpDB(DBQuery.java:570)\n\tat
>>         com.orkash.Crawling.CrawlerThread.run(CrawlerThread.java:145)\n\tat
>>         java.lang.Thread.run(Thread.java:745)\nCaused by:
>>         org.apache.accumulo.core.client.impl.AccumuloServerException:
>>         Error on server orkash1:9997\n\tat */
>>
>>         I also found exceptions in Monitor related to Tracing.
>>
>>         *Tracing spans are being dropped because there are already 5000 spans queued
for delivery.
>>         This does not affect performance, security or data integrity, but distributed
tracing information is being lost.**
>>         **
>>         **and**6458 times**
>>         **Got an IOException in internalRead!
>>         	java.io.IOException: Connection reset by peer
>>         		at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>         		at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>         		at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>         		at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>         		at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>         		at org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
>>         		at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:537)
>>         		at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:338)
>>         		at org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:203)
>>         		at org.apache.accumulo.server.rpc.CustomNonBlockingServer$SelectAcceptThread.select(CustomNonBlockingServer.java:228)
>>         		at org.apache.accumulo.server.rpc.CustomNonBlockingServer$SelectAcceptThread.run*
>>
>>
>>
>>         I am facing the following exceptions in tserver logs and one
>>         tserver goes dead.
>>
>>         *2015-12-22 09:37:27,173 [zookeeper.ZooCache] WARN : Saw
>>         (possibly) transient exception communicating with ZooKeeper,
>>         will retry**
>>         **org.apache.zookeeper.KeeperException$ConnectionLossException:
>>         KeeperErrorCode = ConnectionLoss for
>>         /accumulo/f8708e0d-9238-41f5-b948-8f435fd01207/tables/16/conf/table.split.threshold**
>>         **        at
>>         org.apache.zookeeper.KeeperException.create(KeeperException.java:99)**
>>         **        at
>>         org.apache.zookeeper.KeeperException.create(KeeperException.java:51)**
>>         **        at
>>         org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)**
>>         **        at
>>         org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:264)**
>>         **        at
>>         org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)**
>>         **        at
>>         org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:289)**
>>         **        at
>>         org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:238)**
>>         **        at
>>         org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:117)**
>>         **        at
>>         org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:103)**
>>         **        at
>>         org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:99)**
>>         **        at
>>         org.apache.accumulo.core.conf.AccumuloConfiguration.getMemoryInBytes(AccumuloConfiguration.java:197)**
>>         **        at
>>         org.apache.accumulo.tserver.tablet.Tablet.findSplitRow(Tablet.java:1604)**
>>         **        at
>>         org.apache.accumulo.tserver.tablet.Tablet.needsSplit(Tablet.java:1772)**
>>         **        at
>>         org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:1853)**
>>         **        at
>>         org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)**
>>         **        at java.lang.Thread.run(Thread.java:745)**
>>         *
>>         These are creating problems in continuously ingesting data
>>         and I also experienced some delay in queries and table create
>>         commands.
>>         Please comment what could be the cause of these exceptions?
>>
>>         Thanks
>>         Mohit Kaushik
>>
>>         **
>>
>>

Mime
View raw message