accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <eric.new...@gmail.com>
Subject Re: Mutation Rejected exception with server Error 1
Date Wed, 23 Dec 2015 19:03:40 GMT
I was simplifying a bit too much. If an error propagates all the way to an
Accumulo client call, then it has stopped retrying for you.

An example:

   - create a batchwriter. this creates an update session within the tserver
   - mutations are sent against this session id
   - mutations are pushed with one-way rpc calls: they are streamed to the
   server with no status sent back to the client
   - what if your your client swaps out?
   - the tablet server times out your update session
   - the next round of mutations will fail to apply
   - your call to addMutation will fail

There are some errors, like tablet-not-found, which can be attributable to
normal operations: balancing, splitting, tserver failure. But not showing
up to an update session for a long period is unexpected.  Not cleaning up
update sessions wastes resources in a server.  Round-trip RPC calls for
each update would be expensive, and require a more sophisticated RPC layer.

If you need to make sure your mutations went in, you will need to call
flush() or close() on your batchwriter.  If there's an error, you will need
to re-send all the mutations since the last flush or close.

Given the large numbers of errors you are experiencing, I suspect you may
need to grow your cluster.  Fortunately, a 300 node accumulo cluster is
known to work, too. :-)

BTW, if you get a MutationsRejectedException, you will need to close the
batch writer, which will re-throw the MutationsRejectedException. I just
ran into this problem this week.

Failure to talk to zookeeper is *really* unexpected.

Have you noticed your nodes using any significant swap?

-Eric


On Wed, Dec 23, 2015 at 8:21 AM, mohit.kaushik <mohit.kaushik@orkash.com>
wrote:

>
> Thanks for the beautiful explanation Eric, so this means that if I get
> Mutations rejected exception due to tablet server failure, the batchwriter
> will resend them to some other server and I do not have worry about them.
> Great...
>
> But what is the case when we get mutations rejected exception and no
> server failure. Today also I faced the mutations rejected exceptions with *"server
> error 1*" due to mainly two reasons. while there is no related exception
> in tablet server logs.
> (1) Failed to connect to zookeeper (192.168.10.122) within 2x zookeeper
> timeout period 30000
> (2) org.apache.accumulo.core.client.impl.AccumuloServerException: Error
> on server orkash3:9997
>
> *org.apache.accumulo.core.client.MutationsRejectedException: # constraint
> violations : 0 security codes: {} # server errors 0 # exceptions 1 at
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
> at
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)
> at
> org.apache.accumulo.core.client.impl.MultiTableBatchWriterImpl$TableBatchWriter.addMutation(MultiTableBatchWriterImpl.java:64)
> at
> com.orkash.accumulo.IngestionWithoutServiceOnCondition.main(IngestionWithoutServiceOnCondition.java:235)
> at com.orkash.db.DBQuery.insertLookUpDB(DBQuery.java:570) at
> com.orkash.Crawling.CrawlerThread.run(CrawlerThread.java:145) at
> java.lang.Thread.run(Thread.java:745) Caused by:
> java.lang.RuntimeException:*
> *Failed to connect to zookeeper (192.168.10.122) within 2x zookeeper
> timeout period 30000 at
> org.apache.accumulo.fate.zookeeper.ZooSession.connect(ZooSession.java:117)*
>
> *org.apache.accumulo.core.client.MutationsRejectedException: # constraint
> violations : 0 security codes: {} # server errors 2 # exceptions 2 at
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
> at
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)
> at
> org.apache.accumulo.core.client.impl.MultiTableBatchWriterImpl$TableBatchWriter.addMutation(MultiTableBatchWriterImpl.java:64)
> at
> com.orkash.accumulo.IngestionWithoutServiceOnCondition.main(IngestionWithoutServiceOnCondition.java:235)
> at com.orkash.db.DBQuery.insertLookUpDB(DBQuery.java:570) at
> com.orkash.Crawling.CrawlerThread.run(CrawlerThread.java:145) at
> java.lang.Thread.run(Thread.java:745) Caused by:
> org.apache.accumulo.core.client.impl.AccumuloServerException: Error on
> server orkash3:9997 at
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:937)
> at
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.access$1600(TabletServerBatchWriter.java:616)*
>
>
>
> both exceptions appears at clientside. I have three zookeeper nodes
> (version 3.4.6) deployed on the same nodes at which tservers run. I got
> these exceptions more than 12000 times which I can see on kibana dashboard.
>
> Thanks
> Mohit Kaushik
>
>
>
> On 12/23/2015 06:22 PM, Eric Newton wrote:
>
> The accumulo batch writer will re-send mutations if a tablet server fails,
> or rejects the mutations because the tablet has moved.  There's nothing you
> have to do to recover from fail-overs and re-balancing.
>
> I'm not a kernel expert, but I believe that a swappiness setting of "1" is
> equivalent to "0".
>
> The error you are seeing is part of the failing tablet server scenario.
> This is a bit complicated, so I'm going to name your three tablet servers
> A, B and C.
>
> Tablet server A is hosting a tablet, let's call it a-tablet.
> Tablet server B is hosting a metadata tablet, let's call it m-tablet.
> m-tablet records the information about a-tablet:
>
>    - where it is hosted
>    - what files it it has, and their approximate sizes
>    - book-keeping related to bulk ingest
>    - etc.. I think the OReilly Accumulo book has some great details
>
> Now when A ingests some data, it eventually flushes the updates from
> memory to a file.
> Tablet server A then writes this new information to m-tablet, on Tablet
> server B.
>
> Now for the failure:
> Tablet server A does a java memory garbage collection, and starts pulling
> data from swap. That makes it go really slow, and it looses its zookeeper
> session.
>
> But, it's running so slowly, that it takes a moment to realize it should
> die.
>
> In the mean time, the thread that is flushing memory, attempts to update
> m-tablet with the new file information.
>
> Fortunately there's a constraint on m-tablet. The constraint is that
> mutations must contain a valid zookeeper session.  This prevents tablet
> server A from making updates to m-tablet when it no long has the right to
> host the tablet.
>
> Your initial error is from tablet server A making an update to tablet
> server B's m-tablet.  It's getting a constraint violation: tablet server A
> has lost its zookeeper session, and will fail momentarily.
>
> To make this extra confusing: A and B might be the same server.
>
> -Eric
>
>
> On Tue, Dec 22, 2015 at 11:31 PM, mohit.kaushik <mohit.kaushik@orkash.com>
> wrote:
>
>>
>> I have 3 tablet servers having around 1.4K tablets. If a tablet server
>> loses its session with zookeeper and killed itself. The system takes some
>> time to move all hosted tablets to other servers.
>>
>> In this case if a ingest in process then what should happen with the
>> mutations going to tablets hosted by that tablet server?
>> Is it the reason for the first exception?Should they not be redirected to
>> other servers?
>> nd I had set the system swappiness to 1. Should I keep it 0 in this case?
>> I will check further.
>>
>> Thanks for the reply
>>
>> -Mohit Kaushik
>>
>>
>> On 12/22/2015 08:17 PM, Eric Newton wrote:
>>
>> A tablet server is given the rights to manage a tablet.
>>
>> It is critical that no other server uses the tablet to maintain
>> consistency.
>>
>> To maintain the right to access a tablet, it must maintain a zookeeper
>> session. The zookeeper session periodically exchanges keep-alive messages.
>> If either party fails to get a keep-alive, zookeeper will close the
>> connection. The client can attempt to reconnect, but if it fails to do so,
>> the session will timeout.
>>
>> If the tablet server loses its session with zookeeper, the rest of the
>> system can take over its tablets.
>>
>> When a tablet detects that it lost its zookeeper session, it kills itself
>> to avoid doing anything with the tablets it no long has the right to host.
>>
>> What you are seeing here is the first step in that process, and it is
>> probably due to the tablet server not sending a keep-alive message to
>> zookeeper in time.
>>
>> There are many reasons for a tablet server to be delayed in sending a
>> keep-alive message. By far the most common is that your system is
>> over-subscribed for memory, and part of the tablet server's memory swapped
>> out. Once the java garbage collection cycle swapped it back in, there was a
>> considerable delay.
>>
>> However, there can be other things going on.  This is just a best guess.
>> Monitor swap usage, as a first diagnostic step.
>>
>> -Eric
>>
>>
>>
>> On Tue, Dec 22, 2015 at 8:30 AM, mohit.kaushik <mohit.kaushik@orkash.com>
>> wrote:
>>
>>> Dear All,
>>>
>>> The mutations rejected exception can be seen at client side with server
>>> error 1.
>>> *org.apache.accumulo.core.client.MutationsRejectedException: #
>>> constraint violations : 0  security codes: {}  # server errors 1 #
>>> exceptions 1\n\tat
>>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)\n\tat
>>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)\n\tat
>>> org.apache.accumulo.core.client.impl.MultiTableBatchWriterImpl$TableBatchWriter.addMutation(MultiTableBatchWriterImpl.java:64)\n\tat
>>> com.orkash.accumulo.IngestionWithoutServiceOnCondition.main(IngestionWithoutServiceOnCondition.java:235)\n\tat
>>> com.orkash.db.DBQuery.insertLookUpDB(DBQuery.java:570)\n\tat
>>> com.orkash.Crawling.CrawlerThread.run(CrawlerThread.java:145)\n\tat
>>> java.lang.Thread.run(Thread.java:745)\nCaused by:
>>> org.apache.accumulo.core.client.impl.AccumuloServerException: Error on
>>> server orkash1:9997\n\tat *
>>>
>>> I also found exceptions in Monitor related to Tracing.
>>>
>>> *Tracing spans are being dropped because there are already 5000 spans queued
for delivery.
>>> This does not affect performance, security or data integrity, but distributed
tracing information is being lost.**and **6458 times**Got an IOException in internalRead!
>>> 	java.io.IOException: Connection reset by peer
>>> 		at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>> 		at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>> 		at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>> 		at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>> 		at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>> 		at org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
>>> 		at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:537)
>>> 		at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:338)
>>> 		at org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:203)
>>> 		at org.apache.accumulo.server.rpc.CustomNonBlockingServer$SelectAcceptThread.select(CustomNonBlockingServer.java:228)
>>> 		at org.apache.accumulo.server.rpc.CustomNonBlockingServer$SelectAcceptThread.run*
>>>
>>>
>>>
>>> I am facing the following exceptions in tserver logs and one tserver
>>> goes dead.
>>>
>>> *2015-12-22 09:37:27,173 [zookeeper.ZooCache] WARN : Saw (possibly)
>>> transient exception communicating with ZooKeeper, will retry*
>>> *org.apache.zookeeper.KeeperException$ConnectionLossException:
>>> KeeperErrorCode = ConnectionLoss for
>>> /accumulo/f8708e0d-9238-41f5-b948-8f435fd01207/tables/16/conf/table.split.threshold*
>>> *        at
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)*
>>> *        at
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)*
>>> *        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)*
>>> *        at
>>> org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:264)*
>>> *        at
>>> org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)*
>>> *        at
>>> org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:289)*
>>> *        at
>>> org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:238)*
>>> *        at
>>> org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:117)*
>>> *        at
>>> org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:103)*
>>> *        at
>>> org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:99)*
>>> *        at
>>> org.apache.accumulo.core.conf.AccumuloConfiguration.getMemoryInBytes(AccumuloConfiguration.java:197)*
>>> *        at
>>> org.apache.accumulo.tserver.tablet.Tablet.findSplitRow(Tablet.java:1604)*
>>> *        at
>>> org.apache.accumulo.tserver.tablet.Tablet.needsSplit(Tablet.java:1772)*
>>> *        at
>>> org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:1853)*
>>> *        at
>>> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)*
>>> *        at java.lang.Thread.run(Thread.java:745)*
>>>
>>> These are creating problems in continuously ingesting data and I also
>>> experienced some delay in queries and table create commands.
>>> Please comment what could be the cause of these exceptions?
>>>
>>> Thanks
>>> Mohit Kaushik
>>>
>>>
>>

Mime
View raw message