giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hassan Eslami <hsn.esl...@gmail.com>
Subject Re: Out of core computation fails with KryoException: Buffer underflow
Date Wed, 09 Nov 2016 17:59:17 GMT
Yes. I think what Sergey meant by that is the OOC is capable of even
spilling 90% of the graph to disk, just to give an example to show OOC is
not limited to memory.

In your case, where you have a 1TB graph and a 10TB disk space, OOC would
let the computation to finish just fine. Although, be aware that the more
data goes on disk the more time spent in reading them back in memory. So,
for instance, if you have 1TB graph size and 100GB memory size and you are
running on a single machine, that means 90% of the graph is going on disk.
If your computation per vertex is not too heavy (which is usually the
case), the execution time will be bounded by disk operations. Let's say you
are using a disk with 150MB/s bandwidth (a good HDD). In the example I
mentioned, each superstep would need 900GB to be read and also written to
the disk. That's roughly 3.5 hours per superstep.

Best,
Hassan

On Wed, Nov 9, 2016 at 11:44 AM, Hai Lan <lanhai1988@gmail.com> wrote:

> Hello Hassan
>
> The 90% is mentioned by Sergey Edunov said "speaking of out of core, we
> tried to spill up to 90% of the graph to the disk. ". So I guess it might
> means OOC is still limited by memory size if the input graph size is over
> 10 times than memory size. By reading your response, just double check if
> the disk size is larger than the input graph, like 1Tb graph and 10Tb disk
> space, it should be able to run it, correct?
>
> Thanks again
>
> Best,
>
> Hai
>
>
> On Wed, Nov 9, 2016 at 12:33 PM, Hassan Eslami <hsn.eslami@gmail.com>
> wrote:
>
>> Hi Hai,
>>
>> 1. One of the goals in having adaptive mechanism was to make OOC faster
>> than cases where you specify the number of partitions explicitly. In
>> particular, if you don't know exactly how much should be the number of
>> partitions, you may end up setting it to a pessimistic number and not
>> taking advantage of the entire available memory. That being said, the
>> adaptive mechanism should always be preferred if you are aiming for higher
>> performance. Also, the adaptive mechanism avoids OOM failures due to
>> message overflow. That means the adaptive mechanism also provide a higher
>> robustness.
>>
>> 2. I don't understand where the 90% you are mentioning comes from. In my
>> example in the other email, the 90% was for the suggested size of tenured
>> memory size (to reduce GC overhead). And OOC mechanism works independently
>> of how much memory is available. There are two fundamental limits for OOC
>> though: a) OOC assumes one partition and its messages can fit entirely in
>> memory. So, if partitions are large and any of them won't fit in memory,
>> you should increase the number of partitions. b) OOC is limited to the
>> "disk" size on each machine. If the amount of data on each machine exceeds
>> the "disk" size, OOC will fail. In that case, you should use more machines
>> or decrease your graph size somehow.
>>
>> Best,
>> Hassan
>>
>>
>> On Wed, Nov 9, 2016 at 9:30 AM, Hai Lan <lanhai1988@gmail.com> wrote:
>>
>>> Many thanks Hassan
>>>
>>> I did test a fixed number of partitions without isStaticGraph=true and
>>> it can work great.
>>>
>>> I'll follow your instruction to test adaptive mechanism then. But I have
>>> two small questions:
>>>
>>> 1. Is there any difference in performance aspect of using fixed number
>>> setting and adaptive setting?
>>>
>>> 2. As I known, the out of core can only split up to 90% input graph into
>>> disk. Does that mean for example, a 10 Tb graph can be processed with at
>>> least 1 Tb available memory?
>>>
>>> Thanks again,
>>>
>>> Best,
>>>
>>> Hai
>>>
>>>
>>> On Tue, Nov 8, 2016 at 12:42 PM, Hassan Eslami <hsn.eslami@gmail.com>
>>> wrote:
>>>
>>>> Hi Hai,
>>>>
>>>> I notice that you are trying to use the new OOC mechanism too. Here is
>>>> my take on your issue:
>>>>
>>>> As mentioned earlier in the thread, we noticed there is a bug with
>>>> "isStaticGraph=true" option. This is a flag only for optimization purposes.
>>>> I'll create a JIRA and send a fix for it, but for now, please run your job
>>>> without this flag. This should help you pass the first superstep.
>>>>
>>>> As for the adaptive mechanism vs. fixed number of partitions, both
>>>> approaches are now acceptable in the new OOC design. If you add "
>>>> giraph.maxPartitionsInMemory" the OOC infrastructure assumes that you
>>>> are using fixed number of partitions in memory and ignores any other
>>>> OOC-related flags in your command. This is done to be backward compatible
>>>> with existing codes depending on OOC in the previous version. But, be
>>>> advised that using this type of out-of-core execution WILL NOT prevent your
>>>> job from failures due to spikes in messages. Also, you have to make sure
>>>> that the number you specify as the number of partitions in memory is set in
>>>> a way that your specified number of partitions and their messages will fit
>>>> in your available memory.
>>>>
>>>> On the other hand, I encourage you to use the adaptive mechanism in
>>>> which you do not have to mention the number of partitions in memory, and
>>>> the OOC mechanism underneath will figure things out automatically. To use
>>>> the adaptive mechanism, you should use the following flags:
>>>> giraph.useOutOfCoreGraph=true
>>>> giraph.waitForRequestsConfirmation=false
>>>> giraph.waitForPerWorkerRequests=true
>>>>
>>>> I know the naming for the flags here is a bit bizarre, but this sets up
>>>> the infrastructure for message flow control which is crucial to avoid
>>>> failures due to messages. The default strategy for the adaptive mechanism
>>>> is threshold based. Meaning that, there are a bunch of thresholds (default
>>>> values for the threshold are defined in ThresholdBasedOracle class) the
>>>> system reacts to those. You should follow some (fairly easy) guidelines to
>>>> set the proper threshold for your system. Please refer to the other email
>>>> response in the same thread for guidelines on how to set your thresholds
>>>> properly.
>>>>
>>>> Hope it helps,
>>>> Best,
>>>> Hassan
>>>>
>>>> On Tue, Nov 8, 2016 at 11:01 AM, Hai Lan <lanhai1988@gmail.com> wrote:
>>>>
>>>>> Hello Denis
>>>>>
>>>>> Thanks for your quick response.
>>>>>
>>>>> I just tested to set the timeout as 3600000. And it seems like
>>>>> superstep 0 can be finished now. However, the job is killed
>>>>> immediately when superstep 1 start. In zookeeper log:
>>>>>
>>>>> 2016-11-08 11:54:13,569 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: checkWorkers: Only found 198 responses of 199 needed to start superstep 1.  Reporting every 30000 msecs, 511036 more msecs left before giving up.
>>>>> 2016-11-08 11:54:13,570 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: logMissingWorkersOnSuperstep: No response from partition 13 (could be master)
>>>>> 2016-11-08 11:54:13,571 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30000 type:create cxid:0x14e81 zxid:0xc76 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir
>>>>> 2016-11-08 11:54:13,571 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30000 type:create cxid:0x14e82 zxid:0xc77 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir
>>>>> 2016-11-08 11:54:21,045 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30000 type:create cxid:0x14f4b zxid:0xc79 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir
>>>>> 2016-11-08 11:54:21,046 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30000 type:create cxid:0x14f4c zxid:0xc7a txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir
>>>>> 2016-11-08 11:54:21,094 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.comm.netty.NettyClient: connectAllAddresses: Successfully added 0 connections, (0 total connected) 0 failed, 0 failures total.
>>>>> 2016-11-08 11:54:21,095 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.partition.PartitionBalancer: balancePartitionsAcrossWorkers: Using algorithm static
>>>>> 2016-11-08 11:54:21,097 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: [Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=22, port=30022):(v=48825003, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=51, port=30051):(v=48825003, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=87, port=30087):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=99, port=30099):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=159, port=30159):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=189, port=30189):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=166, port=30166):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=172, port=30172):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=195, port=30195):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=116, port=30116):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=154, port=30154):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=2, port=30002):(v=58590001, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=123, port=30123):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=52, port=30052):(v=48825001, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=188, port=30188):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=165, port=30165):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=171, port=30171):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=23, port=30023):(v=48825003, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=117, port=30117):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=20, port=30020):(v=48825003, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=89, port=30089):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=53, port=30053):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=168, port=30168):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=187, port=30187):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=179, port=30179):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=118, port=30118):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=75, port=30075):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=152, port=30152):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=21, port=30021):(v=48825003, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=88, port=30088):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=180, port=30180):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=54, port=30054):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=76, port=30076):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=119, port=30119):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=167, port=30167):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=153, port=30153):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=196, port=30196):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=170, port=30170):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=103, port=30103):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=156, port=30156):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=120, port=30120):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=150, port=30150):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=67, port=30067):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=59, port=30059):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=84, port=30084):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=19, port=30019):(v=48825003, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=102, port=30102):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=169, port=30169):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=34, port=30034):(v=48825003, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=162, port=30162):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=157, port=30157):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=83, port=30083):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=151, port=30151):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=121, port=30121):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=131, port=30131):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=101, port=30101):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=161, port=30161):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=122, port=30122):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=158, port=30158):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=148, port=30148):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=86, port=30086):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=140, port=30140):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=91, port=30091):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=100, port=30100):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=160, port=30160):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=149, port=30149):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=85, port=30085):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=139, port=30139):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=13, port=30013):(v=48825003, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=80, port=30080):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=92, port=30092):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=112, port=30112):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=147, port=30147):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=184, port=30184):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=8, port=30008):(v=48825004, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=45, port=30045):(v=48825003, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=58, port=30058):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=32, port=30032):(v=48825003, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=106, port=30106):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=63, port=30063):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=142, port=30142):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=71, port=30071):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=40, port=30040):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=130, port=30130):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=39, port=30039):(v=48825003, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=111, port=30111):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=93, port=30093):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=14, port=30014):(v=48825003, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=7, port=30007):(v=48825004, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=185, port=30185):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=46, port=30046):(v=48825003, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=33, port=30033):(v=48825003, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=105, port=30105):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=62, port=30062):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=141, port=30141):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=70, port=30070):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=135, port=30135):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=186, port=30186):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=94, port=30094):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=114, port=30114):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=43, port=30043):(v=48825003, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=30, port=30030):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=128, port=30128):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=144, port=30144):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=69, port=30069):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=194, port=30194):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=10, port=30010):(v=48825004, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=132, port=30132):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=104, port=30104):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=61, port=30061):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=11, port=30011):(v=48825003, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=42, port=30042):(v=48825003, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=178, port=30178):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=12, port=30012):(v=48825003, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=134, port=30134):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=113, port=30113):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=44, port=30044):(v=48825003, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=155, port=30155):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=31, port=30031):(v=48825003, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=68, port=30068):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=9, port=30009):(v=48825004, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=95, port=30095):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=143, port=30143):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=133, port=30133):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=60, port=30060):(v=48824999, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=129, port=30129):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=177, port=30177):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=41, port=30041):(v=48825003, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=198, port=30198):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=193, port=30193):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=145, port=30145):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=66, port=30066):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=49, port=30049):(v=48825003, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=28, port=30028):(v=48825003, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=4, port=30004):(v=58590001, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=26, port=30026):(v=48825003, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=181, port=30181):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=55, port=30055):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=96, port=30096):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=17, port=30017):(v=48825003, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=36, port=30036):(v=48825003, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=176, port=30176):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=108, port=30108):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=77, port=30077):(v=48824999, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=126, port=30126):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=136, port=30136):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=197, port=30197):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=90, port=30090):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=29, port=30029):(v=48825003, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=192, port=30192):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=50, port=30050):(v=48825003, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=3, port=30003):(v=58590001, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=56, port=30056):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=97, port=30097):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=18, port=30018):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=127, port=30127):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=35, port=30035):(v=48825003, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=78, port=30078):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=175, port=30175):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=82, port=30082):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=107, port=30107):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=74, port=30074):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=47, port=30047):(v=48825003, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=1, port=30001):(v=58589999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=115, port=30115):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=38, port=30038):(v=48825003, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=110, port=30110):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=15, port=30015):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=124, port=30124):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=6, port=30006):(v=48825004, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=191, port=30191):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=182, port=30182):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=174, port=30174):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=98, port=30098):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=164, port=30164):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=65, port=30065):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=24, port=30024):(v=48825003, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=79, port=30079):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=73, port=30073):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=138, port=30138):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=81, port=30081):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=5, port=30005):(v=58590001, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=190, port=30190):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=27, port=30027):(v=48825003, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=37, port=30037):(v=48825003, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=199, port=30199):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=146, port=30146):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=57, port=30057):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=48, port=30048):(v=48825003, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=183, port=30183):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=173, port=30173):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=25, port=30025):(v=48825003, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=64, port=30064):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=16, port=30016):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=125, port=30125):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=109, port=30109):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=163, port=30163):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=72, port=30072):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=137, port=30137):(v=48824999, e=0),]
>>>>> 2016-11-08 11:54:21,098 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Vertices - Mean: 49070351, Min: Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=87, port=30087) - 48824999, Max: Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=5, port=30005) - 58590001
>>>>> 2016-11-08 11:54:21,098 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Edges - Mean: 0, Min: Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=87, port=30087) - 0, Max: Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=5, port=30005) - 0
>>>>> 2016-11-08 11:54:21,104 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 1 on path /_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerFinishedDir
>>>>> 2016-11-08 11:54:29,090 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: setJobState: {"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1} on superstep 1
>>>>> 2016-11-08 11:54:29,094 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30044 type:create cxid:0x1b zxid:0xd46 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_masterJobState
>>>>> 2016-11-08 11:54:29,094 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: setJobState: {"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1}
>>>>> 2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba3004a type:create cxid:0x1b zxid:0xd47 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_masterJobState
>>>>> 2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba3004f type:create cxid:0x1b zxid:0xd48 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_masterJobState
>>>>> 2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30054 type:create cxid:0x1b zxid:0xd49 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_masterJobState
>>>>> 2016-11-08 11:54:29,096 FATAL [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: failJob: Killing job job_1477020594559_0051
>>>>>
>>>>>
>>>>> Any other ideas?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> BR,
>>>>>
>>>>>
>>>>> Hai
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 8, 2016 at 9:48 AM, Denis Dudinski <
>>>>> denis.dudinski@gmail.com> wrote:
>>>>>
>>>>>> Hi Hai,
>>>>>>
>>>>>> I think we saw something like this in our environment.
>>>>>>
>>>>>> Interesting row is this one:
>>>>>> 2016-10-27 19:04:00,000 INFO [SessionTracker]
>>>>>> org.apache.zookeeper.server.ZooKeeperServer: Expiring session
>>>>>> 0x158084f5b2100b8, timeout of 600000ms exceeded
>>>>>>
>>>>>> I think that one of workers due to some reason did not communicate
>>>>>> with ZooKeeper for quite a long time (it may be heavy network load or
>>>>>> high CPU consumption, see your monitoring infrastructure, it should
>>>>>> give you a hint). ZooKeeper session expires and all ephemeral nodes
>>>>>> for that worker in ZooKeeper tree are deleted. Master thinks that
>>>>>> worker is dead and halts computation.
>>>>>>
>>>>>> Your ZooKeeper session timeout is 600000 ms which is 10 minutes. We
>>>>>> set this value to much more higher value equal to 1 hour and were able
>>>>>> to perform computations successfully.
>>>>>>
>>>>>> I hope it will help in your case too.
>>>>>>
>>>>>> Best Regards,
>>>>>> Denis Dudinski
>>>>>>
>>>>>> 2016-11-08 16:43 GMT+03:00 Hai Lan <lanhai1988@gmail.com>:
>>>>>> > Hi Guys
>>>>>> >
>>>>>> > The OutOfMemoryError might be solved be adding
>>>>>> > "-Dmapreduce.map.memory.mb=14848". But in my tests, I found some
>>>>>> more
>>>>>> > problems during running out of core graph.
>>>>>> >
>>>>>> > I did two tests with 150G 10^10 vertices input in 1.2 version, and
>>>>>> it seems
>>>>>> > like it not necessary to add like
>>>>>> > "giraph.userPartitionCount=1000,giraph.maxPartitionsInMemory=1"
>>>>>> cause it is
>>>>>> > adaptive. However, If I run without setting "userPartitionCount and
>>>>>> > maxPartitionsInMemory", it will it will keep running on superstep -1
>>>>>> > forever. None of worker can finish superstep -1. And I can see a
>>>>>> warn in
>>>>>> > zookeeper log, not sure if it is the problem:
>>>>>> >
>>>>>> > WARN [netty-client-worker-3]
>>>>>> > org.apache.giraph.comm.netty.handler.ResponseClientHandler:
>>>>>> exceptionCaught:
>>>>>> > Channel failed with remote address
>>>>>> > trantor21.umiacs.umd.edu/192.168.74.221:30172
>>>>>> > java.lang.ArrayIndexOutOfBoundsException: 1075052544
>>>>>> >       at
>>>>>> > org.apache.giraph.comm.flow_control.NoOpFlowControl.getAckSi
>>>>>> gnalFlag(NoOpFlowControl.java:52)
>>>>>> >       at
>>>>>> > org.apache.giraph.comm.netty.NettyClient.messageReceived(Net
>>>>>> tyClient.java:796)
>>>>>> >       at
>>>>>> > org.apache.giraph.comm.netty.handler.ResponseClientHandler.c
>>>>>> hannelRead(ResponseClientHandler.java:87)
>>>>>> >       at
>>>>>> > io.netty.channel.DefaultChannelHandlerContext.invokeChannelR
>>>>>> ead(DefaultChannelHandlerContext.java:338)
>>>>>> >       at
>>>>>> > io.netty.channel.DefaultChannelHandlerContext.fireChannelRea
>>>>>> d(DefaultChannelHandlerContext.java:324)
>>>>>> >       at
>>>>>> > io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>>>>>> ToMessageDecoder.java:153)
>>>>>> >       at
>>>>>> > io.netty.channel.DefaultChannelHandlerContext.invokeChannelR
>>>>>> ead(DefaultChannelHandlerContext.java:338)
>>>>>> >       at
>>>>>> > io.netty.channel.DefaultChannelHandlerContext.fireChannelRea
>>>>>> d(DefaultChannelHandlerContext.java:324)
>>>>>> >       at
>>>>>> > org.apache.giraph.comm.netty.InboundByteCounter.channelRead(
>>>>>> InboundByteCounter.java:74)
>>>>>> >       at
>>>>>> > io.netty.channel.DefaultChannelHandlerContext.invokeChannelR
>>>>>> ead(DefaultChannelHandlerContext.java:338)
>>>>>> >       at
>>>>>> > io.netty.channel.DefaultChannelHandlerContext.fireChannelRea
>>>>>> d(DefaultChannelHandlerContext.java:324)
>>>>>> >       at
>>>>>> > io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>>>>>> ultChannelPipeline.java:785)
>>>>>> >       at
>>>>>> > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.re
>>>>>> ad(AbstractNioByteChannel.java:126)
>>>>>> >       at
>>>>>> > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEven
>>>>>> tLoop.java:485)
>>>>>> >       at
>>>>>> > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimiz
>>>>>> ed(NioEventLoop.java:452)
>>>>>> >       at io.netty.channel.nio.NioEventL
>>>>>> oop.run(NioEventLoop.java:346)
>>>>>> >       at
>>>>>> > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(Sin
>>>>>> gleThreadEventExecutor.java:101)
>>>>>> >       at java.lang.Thread.run(Thread.java:745)
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > If I add giraph.userPartitionCount=1000
>>>>>> ,giraph.maxPartitionsInMemory=1.
>>>>>> > Whole command is :
>>>>>> >
>>>>>> > hadoop jar
>>>>>> > /home/hlan/giraph-1.2.0-hadoop2/giraph-examples/target/girap
>>>>>> h-examples-1.2.0-hadoop2-for-hadoop-2.6.0-jar-with-dependencies.jar
>>>>>> > org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreGraph=true
>>>>>> > -Ddigraph.block_factory_configurators=org.apache.giraph.conf
>>>>>> .FacebookConfiguration
>>>>>> > -Dmapreduce.map.memory.mb=14848 org.apache.giraph.examples.myTask
>>>>>> -vif
>>>>>> > org.apache.giraph.examples.LongFloatNullTextInputFormat -vip
>>>>>> > /user/hlan/cube/tmp/out/ -vof
>>>>>> > org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
>>>>>> > /user/hlan/output -w 199 -ca
>>>>>> > mapred.job.tracker=localhost:5431,steps=6,giraph.isStaticGra
>>>>>> ph=true,giraph.numInputThreads=10,giraph.userPartitionCount=
>>>>>> 1000,giraph.maxPartitionsInMemory=1
>>>>>> >
>>>>>> > the job will be pass superstep -1 very quick (around 10 mins). But
>>>>>> it will
>>>>>> > be killed near end of superstep 0.
>>>>>> >
>>>>>> > 2016-10-27 18:53:56,607 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
>>>>>> Vertices
>>>>>> > - Mean: 9810049, Min: Worker(hostname=trantor11.umiacs.umd.edu
>>>>>> > hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) -
>>>>>> 9771533, Max:
>>>>>> > Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=
>>>>>> trantor02.umiacs.umd.edu,
>>>>>> > MRtaskID=49, port=30049) - 9995724
>>>>>> > 2016-10-27 18:53:56,608 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.partition.PartitionUtils: analyzePartitionStats:
>>>>>> Edges -
>>>>>> > Mean: 0, Min: Worker(hostname=trantor11.umiacs.umd.edu
>>>>>> > hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) - 0,
>>>>>> Max:
>>>>>> > Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=
>>>>>> trantor02.umiacs.umd.edu,
>>>>>> > MRtaskID=49, port=30049) - 0
>>>>>> > 2016-10-27 18:53:56,634 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:54:26,638 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:54:56,640 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:55:26,641 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:55:56,642 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:56:26,643 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:56:56,644 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:57:26,645 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:57:56,646 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:58:26,647 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:58:56,675 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:59:26,676 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 18:59:56,677 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 19:00:26,678 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 19:00:56,679 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 19:01:26,680 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 19:01:29,610 WARN [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
>>>>>> exception
>>>>>> > EndOfStreamException: Unable to read additional data from client
>>>>>> sessionid
>>>>>> > 0x158084f5b2100c6, likely client has closed socket
>>>>>> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn
>>>>>> .java:220)
>>>>>> > at
>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServ
>>>>>> erCnxnFactory.java:208)
>>>>>> > at java.lang.Thread.run(Thread.java:745)
>>>>>> > 2016-10-27 19:01:29,612 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>> connection for
>>>>>> > client /192.168.74.212:53136 which had sessionid 0x158084f5b2100c6
>>>>>> > 2016-10-27 19:01:31,702 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
>>>>>> connection
>>>>>> > from /192.168.74.212:56696
>>>>>> > 2016-10-27 19:01:31,711 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to
>>>>>> renew
>>>>>> > session 0x158084f5b2100c6 at /192.168.74.212:56696
>>>>>> > 2016-10-27 19:01:31,712 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Established session
>>>>>> > 0x158084f5b2100c6 with negotiated timeout 600000 for client
>>>>>> > /192.168.74.212:56696
>>>>>> > 2016-10-27 19:01:56,681 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 19:02:20,029 WARN [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
>>>>>> exception
>>>>>> > EndOfStreamException: Unable to read additional data from client
>>>>>> sessionid
>>>>>> > 0x158084f5b2100c5, likely client has closed socket
>>>>>> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn
>>>>>> .java:220)
>>>>>> > at
>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServ
>>>>>> erCnxnFactory.java:208)
>>>>>> > at java.lang.Thread.run(Thread.java:745)
>>>>>> > 2016-10-27 19:02:20,030 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>> connection for
>>>>>> > client /192.168.74.212:53134 which had sessionid 0x158084f5b2100c5
>>>>>> > 2016-10-27 19:02:21,584 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
>>>>>> connection
>>>>>> > from /192.168.74.212:56718
>>>>>> > 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to
>>>>>> renew
>>>>>> > session 0x158084f5b2100c5 at /192.168.74.212:56718
>>>>>> > 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Established session
>>>>>> > 0x158084f5b2100c5 with negotiated timeout 600000 for client
>>>>>> > /192.168.74.212:56718
>>>>>> > 2016-10-27 19:02:26,682 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 19:02:56,683 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 19:03:05,743 WARN [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
>>>>>> exception
>>>>>> > EndOfStreamException: Unable to read additional data from client
>>>>>> sessionid
>>>>>> > 0x158084f5b2100b9, likely client has closed socket
>>>>>> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn
>>>>>> .java:220)
>>>>>> > at
>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServ
>>>>>> erCnxnFactory.java:208)
>>>>>> > at java.lang.Thread.run(Thread.java:745)
>>>>>> > 2016-10-27 19:03:05,744 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>> connection for
>>>>>> > client /192.168.74.203:51130 which had sessionid 0x158084f5b2100b9
>>>>>> > 2016-10-27 19:03:07,452 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
>>>>>> connection
>>>>>> > from /192.168.74.203:54676
>>>>>> > 2016-10-27 19:03:07,493 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to
>>>>>> renew
>>>>>> > session 0x158084f5b2100b9 at /192.168.74.203:54676
>>>>>> > 2016-10-27 19:03:07,494 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Established session
>>>>>> > 0x158084f5b2100b9 with negotiated timeout 600000 for client
>>>>>> > /192.168.74.203:54676
>>>>>> > 2016-10-27 19:03:26,684 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 19:03:53,712 WARN [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
>>>>>> exception
>>>>>> > EndOfStreamException: Unable to read additional data from client
>>>>>> sessionid
>>>>>> > 0x158084f5b2100be, likely client has closed socket
>>>>>> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn
>>>>>> .java:220)
>>>>>> > at
>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServ
>>>>>> erCnxnFactory.java:208)
>>>>>> > at java.lang.Thread.run(Thread.java:745)
>>>>>> > 2016-10-27 19:03:53,713 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>> connection for
>>>>>> > client /192.168.74.203:51146 which had sessionid 0x158084f5b2100be
>>>>>> > 2016-10-27 19:03:55,436 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
>>>>>> connection
>>>>>> > from /192.168.74.203:54694
>>>>>> > 2016-10-27 19:03:55,482 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to
>>>>>> renew
>>>>>> > session 0x158084f5b2100be at /192.168.74.203:54694
>>>>>> > 2016-10-27 19:03:55,483 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>> 0.0.0.0:22181]
>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Established session
>>>>>> > 0x158084f5b2100be with negotiated timeout 600000 for client
>>>>>> > /192.168.74.203:54694
>>>>>> > 2016-10-27 19:03:56,719 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>> out of 199
>>>>>> > workers finished on superstep 0 on path
>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>> > 2016-10-27 19:04:00,000 INFO [SessionTracker]
>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Expiring session
>>>>>> > 0x158084f5b2100b8, timeout of 600000ms exceeded
>>>>>> > 2016-10-27 19:04:00,001 INFO [SessionTracker]
>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Expiring session
>>>>>> > 0x158084f5b2100c2, timeout of 600000ms exceeded
>>>>>> > 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
>>>>>> > org.apache.zookeeper.server.PrepRequestProcessor: Processed session
>>>>>> > termination for sessionid: 0x158084f5b2100b8
>>>>>> > 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
>>>>>> > org.apache.zookeeper.server.PrepRequestProcessor: Processed session
>>>>>> > termination for sessionid: 0x158084f5b2100c2
>>>>>> > 2016-10-27 19:04:00,004 INFO [SyncThread:0]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>> connection for
>>>>>> > client /192.168.74.203:51116 which had sessionid 0x158084f5b2100b8
>>>>>> > 2016-10-27 19:04:00,006 INFO [SyncThread:0]
>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>> connection for
>>>>>> > client /192.168.74.212:53128 which had sessionid 0x158084f5b2100c2
>>>>>> > 2016-10-27 19:04:00,033 INFO [org.apache.giraph.master.Mast
>>>>>> erThread]
>>>>>> > org.apache.giraph.master.BspServiceMaster: setJobState:
>>>>>> > {"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1}
>>>>>> on
>>>>>> > superstep 0
>>>>>> >
>>>>>> > Any Idea about this?
>>>>>> >
>>>>>> > Thanks,
>>>>>> >
>>>>>> > Hai
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Nov 8, 2016 at 6:37 AM, Denis Dudinski <
>>>>>> denis.dudinski@gmail.com>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >> Hi Xenia,
>>>>>> >>
>>>>>> >> Thank you! I'll check the thread you mentioned.
>>>>>> >>
>>>>>> >> Best Regards,
>>>>>> >> Denis Dudinski
>>>>>> >>
>>>>>> >> 2016-11-08 14:16 GMT+03:00 Xenia Demetriou <xeniad20@gmail.com>:
>>>>>> >> > Hi Denis,
>>>>>> >> >
>>>>>> >> > For the "java.lang.OutOfMemoryError: GC overhead limit exceeded"
>>>>>> error
>>>>>> >> > I
>>>>>> >> > hope that the  conversation in below link can help you.
>>>>>> >> >  www.mail-archive.com/user@giraph.apache.org/msg02938.html
>>>>>> >> >
>>>>>> >> > Regards,
>>>>>> >> > Xenia
>>>>>> >> >
>>>>>> >> > 2016-11-08 12:25 GMT+02:00 Denis Dudinski <
>>>>>> denis.dudinski@gmail.com>:
>>>>>> >> >>
>>>>>> >> >> Hi Hassan,
>>>>>> >> >>
>>>>>> >> >> Thank you for really quick response!
>>>>>> >> >>
>>>>>> >> >> I changed "giraph.isStaticGraph" to false and the error
>>>>>> disappeared.
>>>>>> >> >> As expected iteration became slow and wrote to disk edges once
>>>>>> again
>>>>>> >> >> in superstep 1.
>>>>>> >> >>
>>>>>> >> >> However, the computation failed at superstep 2 with error
>>>>>> >> >> "java.lang.OutOfMemoryError: GC overhead limit exceeded". It
>>>>>> seems to
>>>>>> >> >> be unrelated to "isStaticGraph" issue, but I think it worth
>>>>>> mentioning
>>>>>> >> >> to see the picture as a whole.
>>>>>> >> >>
>>>>>> >> >> Are there any other tests/information I am able to
>>>>>> execute/check to
>>>>>> >> >> help to pinpoint "isStaticGraph" problem?
>>>>>> >> >>
>>>>>> >> >> Best Regards,
>>>>>> >> >> Denis Dudinski
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> 2016-11-07 20:00 GMT+03:00 Hassan Eslami <hsn.eslami@gmail.com
>>>>>> >:
>>>>>> >> >> > Hi Denis,
>>>>>> >> >> >
>>>>>> >> >> > Thanks for bringing up the issue. In the previous conversation
>>>>>> >> >> > thread,
>>>>>> >> >> > the
>>>>>> >> >> > similar problem is reported even with a simpler example
>>>>>> connected
>>>>>> >> >> > component
>>>>>> >> >> > calculation. Although, back then, we were developing other
>>>>>> >> >> > performance-critical components of OOC.
>>>>>> >> >> >
>>>>>> >> >> > Let's debug this issue together to make the new OOC more
>>>>>> stable. I
>>>>>> >> >> > suspect
>>>>>> >> >> > the problem is with "giraph.isStaticGraph=true" (as this is
>>>>>> only an
>>>>>> >> >> > optimization and most of our end-to-end testing was on cases
>>>>>> where
>>>>>> >> >> > the
>>>>>> >> >> > graph
>>>>>> >> >> > could change). Let's get rid of it for now and see if the
>>>>>> problem
>>>>>> >> >> > still
>>>>>> >> >> > exists.
>>>>>> >> >> >
>>>>>> >> >> > Best,
>>>>>> >> >> > Hassan
>>>>>> >> >> >
>>>>>> >> >> > On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski
>>>>>> >> >> > <denis.dudinski@gmail.com>
>>>>>> >> >> > wrote:
>>>>>> >> >> >>
>>>>>> >> >> >> Hello,
>>>>>> >> >> >>
>>>>>> >> >> >> We are trying to calculate PageRank on huge graph, which
>>>>>> does not
>>>>>> >> >> >> fit
>>>>>> >> >> >> into memory. For calculation to succeed we tried to turn on
>>>>>> >> >> >> OutOfCore
>>>>>> >> >> >> feature of Giraph, but every launch we tried resulted in
>>>>>> >> >> >> com.esotericsoftware.kryo.KryoException: Buffer underflow.
>>>>>> >> >> >> Each time it happens on different servers but exactly right
>>>>>> after
>>>>>> >> >> >> start of superstep 1.
>>>>>> >> >> >>
>>>>>> >> >> >> We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version,
>>>>>> can't
>>>>>> >> >> >> back-step to Giraph's officially supported version and had
>>>>>> to patch
>>>>>> >> >> >> Giraph a little) placed on 11 servers + 3 master servers
>>>>>> (namenodes
>>>>>> >> >> >> etc.) with separate ZooKeeper cluster deployment.
>>>>>> >> >> >>
>>>>>> >> >> >> Our launch command:
>>>>>> >> >> >>
>>>>>> >> >> >> hadoop jar /opt/giraph-1.2.0/pr-job-jar-w
>>>>>> ith-dependencies.jar
>>>>>> >> >> >> org.apache.giraph.GiraphRunner
>>>>>> >> >> >> com.prototype.di.pr.PageRankComputation
>>>>>> >> >> >> \
>>>>>> >> >> >> -mc com.prototype.di.pr.PageRankMasterCompute \
>>>>>> >> >> >> -yj pr-job-jar-with-dependencies.jar \
>>>>>> >> >> >> -vif com.belprime.di.pr.input.HBLongVertexInputFormat \
>>>>>> >> >> >> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat
>>>>>> \
>>>>>> >> >> >> -op /user/hadoop/output/pr_test \
>>>>>> >> >> >> -w 10 \
>>>>>> >> >> >> -c com.prototype.di.pr.PRDoubleCombiner \
>>>>>> >> >> >> -wc com.prototype.di.pr.PageRankWorkerContext \
>>>>>> >> >> >> -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \
>>>>>> >> >> >> -ca giraph.logLevel=info \
>>>>>> >> >> >> -ca hbase.mapreduce.inputtable=di_test \
>>>>>> >> >> >> -ca hbase.mapreduce.scan.columns=di:n \
>>>>>> >> >> >> -ca hbase.defaults.for.version.skip=true \
>>>>>> >> >> >> -ca hbase.table.row.textkey=false \
>>>>>> >> >> >> -ca giraph.yarn.task.heap.mb=48000 \
>>>>>> >> >> >> -ca giraph.isStaticGraph=true \
>>>>>> >> >> >> -ca giraph.SplitMasterWorker=false \
>>>>>> >> >> >> -ca giraph.oneToAllMsgSending=true \
>>>>>> >> >> >> -ca giraph.metrics.enable=true \
>>>>>> >> >> >> -ca giraph.jmap.histo.enable=true \
>>>>>> >> >> >> -ca
>>>>>> >> >> >> giraph.vertexIdClass=com.prototype.di.pr.DomainPartAwareLong
>>>>>> Writable
>>>>>> >> >> >> \
>>>>>> >> >> >> -ca
>>>>>> >> >> >> giraph.outgoingMessageValueClass=org.apache.hadoop.io.Double
>>>>>> Writable
>>>>>> >> >> >> \
>>>>>> >> >> >> -ca
>>>>>> >> >> >> giraph.inputOutEdgesClass=org.apache.giraph.edge.LongNullArr
>>>>>> ayEdges
>>>>>> >> >> >> \
>>>>>> >> >> >> -ca giraph.useOutOfCoreGraph=true \
>>>>>> >> >> >> -ca giraph.waitForPerWorkerRequests=true \
>>>>>> >> >> >> -ca giraph.maxNumberOfUnsentRequests=1000 \
>>>>>> >> >> >> -ca
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> giraph.vertexInputFilterClass=com.prototype.di.pr.input.Page
>>>>>> sFromSameDomainLimiter
>>>>>> >> >> >> \
>>>>>> >> >> >> -ca giraph.useInputSplitLocality=true \
>>>>>> >> >> >> -ca hbase.mapreduce.scan.cachedrows=10000 \
>>>>>> >> >> >> -ca giraph.minPartitionsPerComputeThread=60 \
>>>>>> >> >> >> -ca
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> giraph.graphPartitionerFactoryClass=com.prototype.di.pr.Doma
>>>>>> inAwareGraphPartitionerFactory
>>>>>> >> >> >> \
>>>>>> >> >> >> -ca giraph.numInputThreads=1 \
>>>>>> >> >> >> -ca giraph.inputSplitSamplePercent=20 \
>>>>>> >> >> >> -ca giraph.pr.maxNeighborsPerVertex=50 \
>>>>>> >> >> >> -ca
>>>>>> >> >> >> giraph.partitionClass=org.apache.giraph.partition.ByteArrayP
>>>>>> artition
>>>>>> >> >> >> \
>>>>>> >> >> >> -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex
>>>>>> \
>>>>>> >> >> >> -ca
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/_b
>>>>>> sp/_partitions
>>>>>> >> >> >>
>>>>>> >> >> >> Logs excerpt:
>>>>>> >> >> >>
>>>>>> >> >> >> 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre
>>>>>> superstep in
>>>>>> >> >> >> worker
>>>>>> >> >> >> context
>>>>>> >> >> >> 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60
>>>>>> >> >> >> partitions
>>>>>> >> >> >> to process with 1 compute thread(s), originally 1 thread(s)
>>>>>> on
>>>>>> >> >> >> superstep 1
>>>>>> >> >> >> 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration:
>>>>>> with 60
>>>>>> >> >> >> partitions in memory and 1 active threads
>>>>>> >> >> >> 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre
>>>>>> superstep1 in PR
>>>>>> >> >> >> computation
>>>>>> >> >> >> 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> getNextIOActions:
>>>>>> >> >> >> usedMemoryFraction = 0.75
>>>>>> >> >> >> 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine:
>>>>>> >> >> >> updateActiveThreadsFraction: updating the number of active
>>>>>> threads
>>>>>> >> >> >> to
>>>>>> >> >> >> 1
>>>>>> >> >> >> 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> updateRequestsCredit: updating the credit to 20
>>>>>> >> >> >> 16/11/06 15:47:17 INFO graph.GraphTaskManager:
>>>>>> installGCMonitoring:
>>>>>> >> >> >> name = PS Scavenge, action = end of minor GC, cause =
>>>>>> Allocation
>>>>>> >> >> >> Failure, duration = 937ms
>>>>>> >> >> >> 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> getNextIOActions:
>>>>>> >> >> >> usedMemoryFraction = 0.72
>>>>>> >> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> getNextIOActions:
>>>>>> >> >> >> usedMemoryFraction = 0.74
>>>>>> >> >> >> 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine:
>>>>>> >> >> >> updateActiveThreadsFraction: updating the number of active
>>>>>> threads
>>>>>> >> >> >> to
>>>>>> >> >> >> 1
>>>>>> >> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> updateRequestsCredit: updating the credit to 20
>>>>>> >> >> >> 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> getNextIOActions:
>>>>>> >> >> >> usedMemoryFraction = 0.76
>>>>>> >> >> >> 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine:
>>>>>> doneProcessingPartition:
>>>>>> >> >> >> processing partition 234 is done!
>>>>>> >> >> >> 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> getNextIOActions:
>>>>>> >> >> >> usedMemoryFraction = 0.79
>>>>>> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine:
>>>>>> >> >> >> updateActiveThreadsFraction: updating the number of active
>>>>>> threads
>>>>>> >> >> >> to
>>>>>> >> >> >> 1
>>>>>> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> updateRequestsCredit: updating the credit to 18
>>>>>> >> >> >> 16/11/06 15:47:21 INFO handler.RequestDecoder: decode:
>>>>>> Server window
>>>>>> >> >> >> metrics MBytes/sec received = 1.0994, MBytesReceived =
>>>>>> 33.0459, ave
>>>>>> >> >> >> received req MBytes = 0.0138, secs waited = 30.058
>>>>>> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> getNextIOActions:
>>>>>> >> >> >> usedMemoryFraction = 0.82
>>>>>> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread
>>>>>> 0's
>>>>>> >> >> >> next
>>>>>> >> >> >> IO command is: StorePartitionIOCommand: (partitionId = 234)
>>>>>> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread
>>>>>> 0's
>>>>>> >> >> >> command StorePartitionIOCommand: (partitionId = 234)
>>>>>> completed:
>>>>>> >> >> >> bytes=
>>>>>> >> >> >> 64419740, duration=351, bandwidth=175.03, bandwidth
>>>>>> (excluding GC
>>>>>> >> >> >> time)=175.03
>>>>>> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> getNextIOActions:
>>>>>> >> >> >> usedMemoryFraction = 0.83
>>>>>> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread
>>>>>> 0's
>>>>>> >> >> >> next
>>>>>> >> >> >> IO command is: StoreIncomingMessageIOCommand: (partitionId =
>>>>>> 234)
>>>>>> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread
>>>>>> 0's
>>>>>> >> >> >> command StoreIncomingMessageIOCommand: (partitionId = 234)
>>>>>> >> >> >> completed:
>>>>>> >> >> >> bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC
>>>>>> >> >> >> time)=NaN
>>>>>> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> getNextIOActions:
>>>>>> >> >> >> usedMemoryFraction = 0.83
>>>>>> >> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager:
>>>>>> installGCMonitoring:
>>>>>> >> >> >> name = PS Scavenge, action = end of minor GC, cause =
>>>>>> Allocation
>>>>>> >> >> >> Failure, duration = 3107ms
>>>>>> >> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager:
>>>>>> installGCMonitoring:
>>>>>> >> >> >> name = PS MarkSweep, action = end of major GC, cause =
>>>>>> Ergonomics,
>>>>>> >> >> >> duration = 15064ms
>>>>>> >> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine:
>>>>>> >> >> >> updateActiveThreadsFraction: updating the number of active
>>>>>> threads
>>>>>> >> >> >> to
>>>>>> >> >> >> 1
>>>>>> >> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> updateRequestsCredit: updating the credit to 20
>>>>>> >> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> getNextIOActions:
>>>>>> >> >> >> usedMemoryFraction = 0.71
>>>>>> >> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread
>>>>>> 0's
>>>>>> >> >> >> next
>>>>>> >> >> >> IO command is: LoadPartitionIOCommand: (partitionId = 234,
>>>>>> superstep
>>>>>> >> >> >> =
>>>>>> >> >> >> 2)
>>>>>> >> >> >> JMap histo dump at Sun Nov 06 15:47:41 CET 2016
>>>>>> >> >> >> 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine:
>>>>>> doneProcessingPartition:
>>>>>> >> >> >> processing partition 364 is done!
>>>>>> >> >> >> 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine:
>>>>>> >> >> >> updateActiveThreadsFraction: updating the number of active
>>>>>> threads
>>>>>> >> >> >> to
>>>>>> >> >> >> 1
>>>>>> >> >> >> 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle:
>>>>>> >> >> >> updateRequestsCredit: updating the credit to 20
>>>>>> >> >> >> --
>>>>>> >> >> >> -- num     #instances         #bytes  class name
>>>>>> >> >> >> -- ----------------------------------------------
>>>>>> >> >> >> --   1:     224004229    10752202992
>>>>>> >> >> >> java.util.concurrent.ConcurrentHashMap$Node
>>>>>> >> >> >> --   2:      19751666     6645730528  [B
>>>>>> >> >> >> --   3:     222135985     5331263640
>>>>>> >> >> >> com.belprime.di.pr.DomainPartAwareLongWritable
>>>>>> >> >> >> --   4:     214686483     5152475592
>>>>>> >> >> >> org.apache.hadoop.io.DoubleWritable
>>>>>> >> >> >> --   5:           353     4357261784
>>>>>> >> >> >> [Ljava.util.concurrent.ConcurrentHashMap$Node;
>>>>>> >> >> >> --   6:        486266      204484688  [I
>>>>>> >> >> >> --   7:       6017652      192564864
>>>>>> >> >> >> org.apache.giraph.utils.UnsafeByteArrayOutputStream
>>>>>> >> >> >> --   8:       3986203      159448120
>>>>>> >> >> >> org.apache.giraph.utils.UnsafeByteArrayInputStream
>>>>>> >> >> >> --   9:       2064182      148621104
>>>>>> >> >> >> org.apache.giraph.graph.ByteValueVertex
>>>>>> >> >> >> --  10:       2064182       82567280
>>>>>> >> >> >> org.apache.giraph.edge.ByteArrayEdges
>>>>>> >> >> >> --  11:       1886875       45285000  java.lang.Integer
>>>>>> >> >> >> --  12:        349409       30747992
>>>>>> >> >> >> java.util.concurrent.ConcurrentHashMap$TreeNode
>>>>>> >> >> >> --  13:        916970       29343040  java.util.Collections$1
>>>>>> >> >> >> --  14:        916971       22007304
>>>>>> >> >> >> java.util.Collections$SingletonSet
>>>>>> >> >> >> --  15:         47270        3781600
>>>>>> >> >> >> java.util.concurrent.ConcurrentHashMap$TreeBin
>>>>>> >> >> >> --  16:         26201        2590912  [C
>>>>>> >> >> >> --  17:         34175        1367000
>>>>>> >> >> >> org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator
>>>>>> >> >> >> --  18:          6143        1067704  java.lang.Class
>>>>>> >> >> >> --  19:         25953         830496  java.lang.String
>>>>>> >> >> >> --  20:         34175         820200
>>>>>> >> >> >> org.apache.giraph.edge.EdgeNoValue
>>>>>> >> >> >> --  21:          4488         703400  [Ljava.lang.Object;
>>>>>> >> >> >> --  22:            70         395424
>>>>>> >> >> >> [Ljava.nio.channels.SelectionKey;
>>>>>> >> >> >> --  23:          2052         328320
>>>>>> java.lang.reflect.Method
>>>>>> >> >> >> --  24:          6600         316800
>>>>>> >> >> >> org.apache.giraph.utils.ByteArrayVertexIdMessages
>>>>>> >> >> >> --  25:          5781         277488  java.util.HashMap$Node
>>>>>> >> >> >> --  26:          5651         271248
>>>>>> java.util.Hashtable$Entry
>>>>>> >> >> >> --  27:          6604         211328
>>>>>> >> >> >> org.apache.giraph.factories.DefaultMessageValueFactory
>>>>>> >> >> >> 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable:
>>>>>> Execution of
>>>>>> >> >> >> callable failed
>>>>>> >> >> >> java.lang.RuntimeException: call: execution of IO command
>>>>>> >> >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2)
>>>>>> failed!
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCa
>>>>>> llable.java:115)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCa
>>>>>> llable.java:36)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.utils.LogStacktraceCallable.call(LogStackt
>>>>>> raceCallable.java:67)
>>>>>> >> >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>>>> Executor.java:1142)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>>>> lExecutor.java:617)
>>>>>> >> >> >> at java.lang.Thread.run(Thread.java:745)
>>>>>> >> >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer
>>>>>> >> >> >> underflow.
>>>>>> >> >> >> at com.esotericsoftware.kryo.io.I
>>>>>> nput.require(Input.java:199)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >> com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInpu
>>>>>> t.java:112)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(KryoData
>>>>>> Input.java:91)
>>>>>> >> >> >> at
>>>>>> >> >> >> org.apache.hadoop.io.LongWritable.readFields(LongWritable.ja
>>>>>> va:47)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutE
>>>>>> dges(DiskBackedPartitionStore.java:245)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMe
>>>>>> moryPartitionData(DiskBackedPartitionStore.java:278)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartition
>>>>>> DataProxy(DiskBackedDataStore.java:234)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPart
>>>>>> itionData(DiskBackedPartitionStore.java:311)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute
>>>>>> (LoadPartitionIOCommand.java:66)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCa
>>>>>> llable.java:99)
>>>>>> >> >> >> ... 6 more
>>>>>> >> >> >> 16/11/06 15:47:49 FATAL graph.GraphTaskManager:
>>>>>> uncaughtException:
>>>>>> >> >> >> OverrideExceptionHandler on thread ooc-io-0, msg = call:
>>>>>> execution
>>>>>> >> >> >> of
>>>>>> >> >> >> IO command LoadPartitionIOCommand: (partitionId = 234,
>>>>>> superstep =
>>>>>> >> >> >> 2)
>>>>>> >> >> >> failed!, exiting...
>>>>>> >> >> >> java.lang.RuntimeException: call: execution of IO command
>>>>>> >> >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2)
>>>>>> failed!
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCa
>>>>>> llable.java:115)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCa
>>>>>> llable.java:36)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.utils.LogStacktraceCallable.call(LogStackt
>>>>>> raceCallable.java:67)
>>>>>> >> >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>>>> Executor.java:1142)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>>>> lExecutor.java:617)
>>>>>> >> >> >> at java.lang.Thread.run(Thread.java:745)
>>>>>> >> >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer
>>>>>> >> >> >> underflow.
>>>>>> >> >> >> at com.esotericsoftware.kryo.io.I
>>>>>> nput.require(Input.java:199)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >> com.esotericsoftware.kryo.io.UnsafeInput.readLong(UnsafeInpu
>>>>>> t.java:112)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(KryoData
>>>>>> Input.java:91)
>>>>>> >> >> >> at
>>>>>> >> >> >> org.apache.hadoop.io.LongWritable.readFields(LongWritable.ja
>>>>>> va:47)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutE
>>>>>> dges(DiskBackedPartitionStore.java:245)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMe
>>>>>> moryPartitionData(DiskBackedPartitionStore.java:278)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartition
>>>>>> DataProxy(DiskBackedDataStore.java:234)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPart
>>>>>> itionData(DiskBackedPartitionStore.java:311)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute
>>>>>> (LoadPartitionIOCommand.java:66)
>>>>>> >> >> >> at
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCa
>>>>>> llable.java:99)
>>>>>> >> >> >> ... 6 more
>>>>>> >> >> >> 16/11/06 15:47:49 ERROR worker.BspServiceWorker:
>>>>>> unregisterHealth:
>>>>>> >> >> >> Got
>>>>>> >> >> >> failure, unregistering health on
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> /_hadoopBsp/giraph_yarn_application_1478342673283_0009/_appl
>>>>>> icationAttemptsDir/0/_superstepDir/1/_workerHealthyDir/datan
>>>>>> ode6.webmeup.com_5
>>>>>> >> >> >> on superstep 1
>>>>>> >> >> >>
>>>>>> >> >> >> We looked into one thread
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.
>>>>>> mbox/%3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT%3DoLJAAPQ
>>>>>> w%40mail.gmail.com%3E
>>>>>> >> >> >> but it is rather old and at that time the answer was "do not
>>>>>> use it
>>>>>> >> >> >> yet".
>>>>>> >> >> >> (see reply
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.
>>>>>> mbox/%3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2B
>>>>>> w%40mail.gmail.com%3E).
>>>>>> >> >> >> Does it hold today? We would like to use new advanced
>>>>>> adaptive OOC
>>>>>> >> >> >> approach if possible...
>>>>>> >> >> >>
>>>>>> >> >> >> Thank you in advance, any help or hint would be really
>>>>>> appreciated.
>>>>>> >> >> >>
>>>>>> >> >> >> Best Regards,
>>>>>> >> >> >> Denis Dudinski
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message