giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claudio Martella <claudio.marte...@gmail.com>
Subject Re: Deadlock when running on Hadoop 1.0.4
Date Tue, 29 Jan 2013 14:39:16 GMT
Quite honestly I do not believe it is connected with ZK, it is quite weird
that it does not pass tests in pseudo-distributed mode... I think it is
quite serious that we cannot run tests on 1.0 not even in
pseudo-distributed mode. I honestly do not know when the bug was
introduced. I think IT MIGHT be connected with multi-threading, by looking
at the logs, but I cannot say for sure. What happens is that one worker
dies due to a Child Error at the computation of the last superstep (number
20), while the other succede and idle at the barrier. Looking at the logs,
the last entry for the failing worker is in the GraphMapper when the worker
announces the number of threads and partitions it is going to compute. That
is right before the compute thead is created and started. But this is
mostly speculation, before a thorough analysis.


On Sat, Jan 26, 2013 at 12:02 AM, Eli Reisman <apache.mailbox@gmail.com>wrote:

> Interesting. Dedicated zk instance doesn't work with hadoop-2.0.x or trunk
> either when running Giraph on YARN/MRv2. I would like to look into this
> more if I have time. Anyone have any ideas? And, anyone have a definitely
> timeline on how long this has been broken? Most of my work with Giraph last
> summer was on a cluster with its own ZK so I have not used the feature
> much. I do rememebr it working on 1.0.something hadoop profile at maybe
> christmas of 2011? But that was a long time ago...
>
>
> On Fri, Jan 25, 2013 at 3:07 AM, Sebastian Schelter <ssc@apache.org>wrote:
>
>> Hi,
>>
>> I get exactly the same deadlock when using a dedicated (non-distributed)
>> ZK instance. I tried 3.3.6 and 3.4.5.
>>
>> I haven't used Giraph for a while, so I can't say whether this has
>> worked recently...
>>
>> Best,
>> Sebastian
>>
>>
>>
>> On 23.01.2013 05:14, Eli Reisman wrote:
>> > Hi Sebastian,
>> >
>> > This seems to be a new issue related to our recent upgrade to
>> > multithreading. I have not seen this before. All my other emails
>> related to
>> > the array index out of bounds error you found over the weekend.
>> >
>> > however, I have had trouble with the local zk instance for some time
>> now on
>> > a number of Giraph profiles and pretty much exclusively use a separate
>> ZK
>> > instance of my own. Last summer I was running a lot of jobs on a 1.0.x
>> > hadoop cluster with Giraph, and I was told to use the on-cluster
>> dedicated
>> > ZK quorum due to "problems" with Giraph's local ZK instanantiation. No
>> one
>> > got more specific with me than that. I also can't get the local ZK
>> > instances to come up on Hadoop-2.0.x -- perhaps this feature of Giraph
>> has
>> > had problems for a while. Was it working for you recently?
>> >
>> > If you notice any other clues as to the cause, please post them I'm
>> hoping
>> > to do some work aorund this soon.
>> >
>> > On Tue, Jan 22, 2013 at 11:05 AM, Claudio Martella <
>> > claudio.martella@gmail.com> wrote:
>> >
>> >> Hi Sebastian,
>> >>
>> >> I do not know what is happening, I am also having problems of jobs
>> >> blocking while waiting to setup the zookeeper instance.
>> >> We should look into this.
>> >>
>> >> Best,
>> >> Claudio
>> >>
>> >>
>> >> On Mon, Jan 21, 2013 at 1:59 PM, Sebastian Schelter <ssc@apache.org
>> >wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> I'm testing a custom PageRank implementation using trunk on Hadoop
>> >>> 1.0.4. I seem to run into a deadlock after the input superstep.
>> >>>
>> >>> The workers report "finishSuperstep: (all workers done) WORKER_ONLY
-
>> >>> Attempt=0, Superstep=0" and the master reports that all workers are
>> done
>> >>> with superstep -1.
>> >>>
>> >>> I reconstructed this using a local setup and seems to me that the
>> >>> BspServiceMaster hangs in the cleanUpZooKeeper method infinitely.
>> >>>
>> >>> I'm not using a dedicated zk instance, I just have Giraph start one.
>> Any
>> >>> ideas what can be done to fix my problem?
>> >>>
>> >>> Best,
>> >>> Sebastian
>> >>>
>> >>>
>> >>> excerpt from jstack
>> >>>
>> >>> "org.apache.giraph.master.MasterThread" prio=10 tid=0x00007f29fc385000
>> >>> nid=0x29d1 waiting on condition [0x00007f2a09a5f000]
>> >>>    java.lang.Thread.State: TIMED_WAITING (parking)
>> >>>         at sun.misc.Unsafe.park(Native Method)
>> >>>         - parking to wait for  <0x00000000f38967d8> (a
>> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> >>>         at
>> >>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
>> >>>         at
>> >>>
>> >>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116)
>> >>>         at
>> >>> org.apache.giraph.zk.PredicateLock.waitMsecs(PredicateLock.java:112)
>> >>>         at
>> >>> org.apache.giraph.zk.PredicateLock.waitForever(PredicateLock.java:138)
>> >>>         at
>> >>>
>> >>>
>> org.apache.giraph.master.BspServiceMaster.cleanUpZooKeeper(BspServiceMaster.java:1602)
>> >>>         at
>> >>>
>> >>>
>> org.apache.giraph.master.BspServiceMaster.cleanup(BspServiceMaster.java:1692)
>> >>>         at
>> >>> org.apache.giraph.master.MasterThread.run(MasterThread.java:144)
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >>    Claudio Martella
>> >>    claudio.martella@gmail.com
>> >>
>> >
>>
>>
>


-- 
   Claudio Martella
   claudio.martella@gmail.com

Mime
View raw message