drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sudheesh Katkam <skat...@maprtech.com>
Subject Re: ZK lost connectivity issue on large cluster
Date Wed, 14 Sep 2016 20:40:04 GMT
Hi Francois,

Few questions:
+ How many zookeeper servers in the quorum?
+ What is the load on atsqa4-133.qa.lab when this happens? Any other applications running
on that node? How many threads is the Drill process using?
+ When running the same query on 12 nodes, is the data size same?
+ Can you share the query profile?

This may not be the right thing to do, but for now, If the cluster is heavily loaded, increase
the zk timeout.

Thank you,
Sudheesh

> On Sep 14, 2016, at 11:53 AM, François Méthot <fmethot78@gmail.com> wrote:
> 
> We are running 1.7.
> The log were taken from the jira tickets.
> 
> We will try out 1.8 soon.
> 
> 
> 
> 
> On Wed, Sep 14, 2016 at 2:52 PM, Chun Chang <cchang@maprtech.com> wrote:
> 
>> Looks like you are running 1.5. I believe there are some work done in that
>> area and the newer release should behave better.
>> 
>> On Wed, Sep 14, 2016 at 11:43 AM, François Méthot <fmethot78@gmail.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>>  We are trying to find a solution/workaround to issue:
>>> 
>>> 2016-01-28 16:36:14,367 [Curator-ServiceCache-0] ERROR
>>> o.a.drill.exec.work.foreman.Foreman - SYSTEM ERROR: ForemanException:
>>> One more more nodes lost connectivity during query.  Identified nodes
>>> were [atsqa4-133.qa.lab:31010].
>>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
>>> ForemanException: One more more nodes lost connectivity during query.
>>> Identified nodes were [atsqa4-133.qa.lab:31010].
>>>        at org.apache.drill.exec.work.foreman.Foreman$ForemanResult.
>>> close(Foreman.java:746)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>        at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>> processEvent(Foreman.java:858)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>        at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>> processEvent(Foreman.java:790)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>        at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.
>>> moveToState(Foreman.java:792)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>        at org.apache.drill.exec.work.foreman.Foreman.moveToState(
>>> Foreman.java:909)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>        at org.apache.drill.exec.work.foreman.Foreman.access$2700(
>>> Foreman.java:110)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>>        at org.apache.drill.exec.work.foreman.Foreman$StateListener.
>>> moveToState(Foreman.java:1183)
>>> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>>> 
>>> 
>>> DRILL-4325  <https://issues.apache.org/jira/browse/DRILL-4325>
>>> ForemanException:
>>> One or more nodes lost connectivity during query
>>> 
>>> 
>>> 
>>> Any one experienced this issue ?
>>> 
>>> It happens when running query involving many parquet files on a cluster
>> of
>>> 200 nodes. Same query on a smaller cluster of 12 nodes runs fine.
>>> 
>>> It is not caused by garbage collection, (checked on both ZK node and the
>>> involved drill bit).
>>> 
>>> Negotiated max session timeout is 40 seconds.
>>> 
>>> The sequence seems:
>>> - Drill Query begins, using an existing ZK session.
>>> - Drill Zk session timeouts
>>>      - perhaps it was writing something that took too long
>>> - Drill attempts to renew session
>>>       - drill believes that the write operation failed, so it attempts
>> to
>>> re-create the zk node, which trigger another exception.
>>> 
>>> We are open to any suggestion. We will report any finding.
>>> 
>>> Thanks
>>> Francois
>>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message