accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <eric.new...@gmail.com>
Subject Re: Some warnings/erros before TServer shutdown.
Date Tue, 08 Dec 2015 12:51:04 GMT
The short answer: your tserver was probably paused by swapping to disk.  It
may have been some other delay, but this is almost always the cause. I
recommend monitoring your swap usage, and determine your overall memory
footprint to ensure it does not happen in the future.

The long answer:

Communications with zookeeper require periodic keep-alive messages, to
ensure that processes are alive and well. If a process fails to check in,
zookeeper removes their session, and any ephemeral nodes it was using, in
this case a lock.  Locks in zookeeper preserve the responsibility of a
tserver to manage its tablets.

Updates to this tserver caused it to perform some metadata table action,
such as adding a recently written file to its list of files. That update
goes to a tserver holding the metadata tablet.  This other tablet server
will throw the constraint violation you see here, if the tablet has lost
it's lock in zookeeper. Later, we see the tablet server realize it has been
paused, lost its lock, and then it kills itself.

Two minutes of pause is substantial, so you can see why zookeeper dismissed
the process and it lost it's session, locks, and right to manage tablets.

-Eric


On Tue, Dec 8, 2015 at 3:52 AM, Martin Grimmer <Martin.Grimmer@mgm-tp.com>
wrote:

> Hello community,
>
>
>
> I am using Accumulo 1.7. Yesterday there were following log messages
> before one of the tserver went down.
>
> Could you give me some hints about their meaning, to better understand
> what was happening?
>
>
>
> 2015-12-07 14:11:02,414 [util.MetadataTableUtil] ERROR: null
>
> ConstraintViolationException(violationSummaries:[TConstraintViolationSummary(constrainClass:org.apache.accumulo.server.constraints.MetadataConstraints,
> violationCode:7, violationDescription:Lock not held in zookeeper by writer,
> numberOfViolatingMutations:1)])
>
>
>
> 2015-12-07 14:11:02,415 [log.TabletServerLogger] ERROR: Unexpected error
> writing to log, retrying attempt 2
>
> java.lang.RuntimeException:
> ConstraintViolationException(violationSummaries:[TConstraintViolationSummary(constrainClass:org.apache.accumulo.server.constraints.MetadataConstraints,
> violationCode:7, violationDescription:Lock not held in zookeeper by writer,
> numberOfViolatingMutations:1)])
>
>
>
> 2015-12-07 14:11:02,403 [impl.Writer] ERROR: error sending update to
> server2.cluster.org:9997:
> ConstraintViolationException(violationSummaries:[TConstraintViolationSummary(constrainClass:org.apache.accumulo.server.constraints.MetadataConstraints,
> violationCode:7, violationDescription:Lock not held in zookeeper by writer,
> numberOfViolatingMutations:1)])
>
>
>
> 2015-12-07 14:11:02,463 [zookeeper.DistributedWorkQueue] INFO : Got
> unexpected zookeeper event: None for
> /accumulo/8a7f6781-ae6e-44bc-a717-5b8cbd28d647/recovery
>
> 2015-12-07 14:11:02,461 [util.MetadataTableUtil] ERROR: null
>
> ConstraintViolationException(violationSummaries:[TConstraintViolationSummary(constrainClass:org.apache.accumulo.server.constraints.MetadataConstraints,
> violationCode:7, violationDescription:Lock not held in zookeeper by writer,
> numberOfViolatingMutations:1)])
>
>
>
> 2015-12-07 14:11:02,476 [tserver.TabletServer] ERROR: Lost tablet server
> lock (reason = SESSION_EXPIRED), exiting.
>
> 2015-12-07 14:11:02,476 [server.GarbageCollectionLogger] WARN : GC pause
> checker not called in a timely fashion. Expected every 30.0 seconds but was
> 128.9 seconds since last check
>
>
>
>
>
> I assume there was to much load on the server, so it got problems while
> communicating with Zookeeper, am I right?
>
>
>
> Best regards
>
> Martin Grimmer
>

Mime
View raw message