giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hai Lan <lanhai1...@gmail.com>
Subject Re: Out of core computation fails with KryoException: Buffer underflow
Date Wed, 09 Nov 2016 18:04:47 GMT
Thank you so much!

Now I fully understand. I'll do more tests and see.

BR,

Hai


On Wed, Nov 9, 2016 at 12:59 PM, Hassan Eslami <hsn.eslami@gmail.com> wrote:

> Yes. I think what Sergey meant by that is the OOC is capable of even
> spilling 90% of the graph to disk, just to give an example to show OOC is
> not limited to memory.
>
> In your case, where you have a 1TB graph and a 10TB disk space, OOC would
> let the computation to finish just fine. Although, be aware that the more
> data goes on disk the more time spent in reading them back in memory. So,
> for instance, if you have 1TB graph size and 100GB memory size and you are
> running on a single machine, that means 90% of the graph is going on disk.
> If your computation per vertex is not too heavy (which is usually the
> case), the execution time will be bounded by disk operations. Let's say you
> are using a disk with 150MB/s bandwidth (a good HDD). In the example I
> mentioned, each superstep would need 900GB to be read and also written to
> the disk. That's roughly 3.5 hours per superstep.
>
> Best,
> Hassan
>
> On Wed, Nov 9, 2016 at 11:44 AM, Hai Lan <lanhai1988@gmail.com> wrote:
>
>> Hello Hassan
>>
>> The 90% is mentioned by Sergey Edunov said "speaking of out of core, we
>> tried to spill up to 90% of the graph to the disk. ". So I guess it
>> might means OOC is still limited by memory size if the input graph size is
>> over 10 times than memory size. By reading your response, just double check
>> if the disk size is larger than the input graph, like 1Tb graph and 10Tb
>> disk space, it should be able to run it, correct?
>>
>> Thanks again
>>
>> Best,
>>
>> Hai
>>
>>
>> On Wed, Nov 9, 2016 at 12:33 PM, Hassan Eslami <hsn.eslami@gmail.com>
>> wrote:
>>
>>> Hi Hai,
>>>
>>> 1. One of the goals in having adaptive mechanism was to make OOC faster
>>> than cases where you specify the number of partitions explicitly. In
>>> particular, if you don't know exactly how much should be the number of
>>> partitions, you may end up setting it to a pessimistic number and not
>>> taking advantage of the entire available memory. That being said, the
>>> adaptive mechanism should always be preferred if you are aiming for higher
>>> performance. Also, the adaptive mechanism avoids OOM failures due to
>>> message overflow. That means the adaptive mechanism also provide a higher
>>> robustness.
>>>
>>> 2. I don't understand where the 90% you are mentioning comes from. In my
>>> example in the other email, the 90% was for the suggested size of tenured
>>> memory size (to reduce GC overhead). And OOC mechanism works independently
>>> of how much memory is available. There are two fundamental limits for OOC
>>> though: a) OOC assumes one partition and its messages can fit entirely in
>>> memory. So, if partitions are large and any of them won't fit in memory,
>>> you should increase the number of partitions. b) OOC is limited to the
>>> "disk" size on each machine. If the amount of data on each machine exceeds
>>> the "disk" size, OOC will fail. In that case, you should use more machines
>>> or decrease your graph size somehow.
>>>
>>> Best,
>>> Hassan
>>>
>>>
>>> On Wed, Nov 9, 2016 at 9:30 AM, Hai Lan <lanhai1988@gmail.com> wrote:
>>>
>>>> Many thanks Hassan
>>>>
>>>> I did test a fixed number of partitions without isStaticGraph=true and
>>>> it can work great.
>>>>
>>>> I'll follow your instruction to test adaptive mechanism then. But I
>>>> have two small questions:
>>>>
>>>> 1. Is there any difference in performance aspect of using fixed number
>>>> setting and adaptive setting?
>>>>
>>>> 2. As I known, the out of core can only split up to 90% input graph
>>>> into disk. Does that mean for example, a 10 Tb graph can be processed with
>>>> at least 1 Tb available memory?
>>>>
>>>> Thanks again,
>>>>
>>>> Best,
>>>>
>>>> Hai
>>>>
>>>>
>>>> On Tue, Nov 8, 2016 at 12:42 PM, Hassan Eslami <hsn.eslami@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Hai,
>>>>>
>>>>> I notice that you are trying to use the new OOC mechanism too. Here is
>>>>> my take on your issue:
>>>>>
>>>>> As mentioned earlier in the thread, we noticed there is a bug with
>>>>> "isStaticGraph=true" option. This is a flag only for optimization purposes.
>>>>> I'll create a JIRA and send a fix for it, but for now, please run your job
>>>>> without this flag. This should help you pass the first superstep.
>>>>>
>>>>> As for the adaptive mechanism vs. fixed number of partitions, both
>>>>> approaches are now acceptable in the new OOC design. If you add "
>>>>> giraph.maxPartitionsInMemory" the OOC infrastructure assumes that you
>>>>> are using fixed number of partitions in memory and ignores any other
>>>>> OOC-related flags in your command. This is done to be backward compatible
>>>>> with existing codes depending on OOC in the previous version. But, be
>>>>> advised that using this type of out-of-core execution WILL NOT prevent your
>>>>> job from failures due to spikes in messages. Also, you have to make sure
>>>>> that the number you specify as the number of partitions in memory is set in
>>>>> a way that your specified number of partitions and their messages will fit
>>>>> in your available memory.
>>>>>
>>>>> On the other hand, I encourage you to use the adaptive mechanism in
>>>>> which you do not have to mention the number of partitions in memory, and
>>>>> the OOC mechanism underneath will figure things out automatically. To use
>>>>> the adaptive mechanism, you should use the following flags:
>>>>> giraph.useOutOfCoreGraph=true
>>>>> giraph.waitForRequestsConfirmation=false
>>>>> giraph.waitForPerWorkerRequests=true
>>>>>
>>>>> I know the naming for the flags here is a bit bizarre, but this sets
>>>>> up the infrastructure for message flow control which is crucial to avoid
>>>>> failures due to messages. The default strategy for the adaptive mechanism
>>>>> is threshold based. Meaning that, there are a bunch of thresholds (default
>>>>> values for the threshold are defined in ThresholdBasedOracle class) the
>>>>> system reacts to those. You should follow some (fairly easy) guidelines to
>>>>> set the proper threshold for your system. Please refer to the other email
>>>>> response in the same thread for guidelines on how to set your thresholds
>>>>> properly.
>>>>>
>>>>> Hope it helps,
>>>>> Best,
>>>>> Hassan
>>>>>
>>>>> On Tue, Nov 8, 2016 at 11:01 AM, Hai Lan <lanhai1988@gmail.com> wrote:
>>>>>
>>>>>> Hello Denis
>>>>>>
>>>>>> Thanks for your quick response.
>>>>>>
>>>>>> I just tested to set the timeout as 3600000. And it seems like
>>>>>> superstep 0 can be finished now. However, the job is killed
>>>>>> immediately when superstep 1 start. In zookeeper log:
>>>>>>
>>>>>> 2016-11-08 11:54:13,569 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: checkWorkers: Only found 198 responses of 199 needed to start superstep 1.  Reporting every 30000 msecs, 511036 more msecs left before giving up.
>>>>>> 2016-11-08 11:54:13,570 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: logMissingWorkersOnSuperstep: No response from partition 13 (could be master)
>>>>>> 2016-11-08 11:54:13,571 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30000 type:create cxid:0x14e81 zxid:0xc76 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir
>>>>>> 2016-11-08 11:54:13,571 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30000 type:create cxid:0x14e82 zxid:0xc77 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir
>>>>>> 2016-11-08 11:54:21,045 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30000 type:create cxid:0x14f4b zxid:0xc79 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerHealthyDir
>>>>>> 2016-11-08 11:54:21,046 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30000 type:create cxid:0x14f4c zxid:0xc7a txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerUnhealthyDir
>>>>>> 2016-11-08 11:54:21,094 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.comm.netty.NettyClient: connectAllAddresses: Successfully added 0 connections, (0 total connected) 0 failed, 0 failures total.
>>>>>> 2016-11-08 11:54:21,095 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.partition.PartitionBalancer: balancePartitionsAcrossWorkers: Using algorithm static
>>>>>> 2016-11-08 11:54:21,097 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: [Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=22, port=30022):(v=48825003, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=51, port=30051):(v=48825003, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=87, port=30087):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=99, port=30099):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=159, port=30159):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=189, port=30189):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=166, port=30166):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=172, port=30172):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=195, port=30195):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=116, port=30116):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=154, port=30154):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=2, port=30002):(v=58590001, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=123, port=30123):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=52, port=30052):(v=48825001, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=188, port=30188):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=165, port=30165):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=171, port=30171):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=23, port=30023):(v=48825003, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=117, port=30117):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=20, port=30020):(v=48825003, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=89, port=30089):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=53, port=30053):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=168, port=30168):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=187, port=30187):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=179, port=30179):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=118, port=30118):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=75, port=30075):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=152, port=30152):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=21, port=30021):(v=48825003, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=88, port=30088):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=180, port=30180):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=54, port=30054):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=76, port=30076):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=119, port=30119):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=167, port=30167):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=153, port=30153):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=196, port=30196):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=170, port=30170):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=103, port=30103):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=156, port=30156):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=120, port=30120):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=150, port=30150):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=67, port=30067):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=59, port=30059):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=84, port=30084):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=19, port=30019):(v=48825003, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=102, port=30102):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=169, port=30169):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=34, port=30034):(v=48825003, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=162, port=30162):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=157, port=30157):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=83, port=30083):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=151, port=30151):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=121, port=30121):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=131, port=30131):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=101, port=30101):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=161, port=30161):(v=48824999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=122, port=30122):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=158, port=30158):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=148, port=30148):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=86, port=30086):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=140, port=30140):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=91, port=30091):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=100, port=30100):(v=48824999, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=160, port=30160):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=149, port=30149):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=85, port=30085):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=139, port=30139):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=13, port=30013):(v=48825003, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=80, port=30080):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=92, port=30092):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=112, port=30112):(v=48824999, e=0),Worker(hostname=trantor14.umiacs.umd.edu hostOrIp=trantor14.umiacs.umd.edu, MRtaskID=147, port=30147):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=184, port=30184):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=8, port=30008):(v=48825004, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=45, port=30045):(v=48825003, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=58, port=30058):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=32, port=30032):(v=48825003, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=106, port=30106):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=63, port=30063):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=142, port=30142):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=71, port=30071):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=40, port=30040):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=130, port=30130):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=39, port=30039):(v=48825003, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=111, port=30111):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=93, port=30093):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=14, port=30014):(v=48825003, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=7, port=30007):(v=48825004, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=185, port=30185):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=46, port=30046):(v=48825003, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=33, port=30033):(v=48825003, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=105, port=30105):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=62, port=30062):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=141, port=30141):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=70, port=30070):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=135, port=30135):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=186, port=30186):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=94, port=30094):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=114, port=30114):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=43, port=30043):(v=48825003, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=30, port=30030):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=128, port=30128):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=144, port=30144):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=69, port=30069):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=194, port=30194):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=10, port=30010):(v=48825004, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=132, port=30132):(v=48824999, e=0),Worker(hostname=trantor18.umiacs.umd.edu hostOrIp=trantor18.umiacs.umd.edu, MRtaskID=104, port=30104):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=61, port=30061):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=11, port=30011):(v=48825003, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=42, port=30042):(v=48825003, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=178, port=30178):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=12, port=30012):(v=48825003, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=134, port=30134):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=113, port=30113):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=44, port=30044):(v=48825003, e=0),Worker(hostname=trantor06.umiacs.umd.edu hostOrIp=trantor06.umiacs.umd.edu, MRtaskID=155, port=30155):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=31, port=30031):(v=48825003, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=68, port=30068):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=9, port=30009):(v=48825004, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=95, port=30095):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=143, port=30143):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=133, port=30133):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=60, port=30060):(v=48824999, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=129, port=30129):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=177, port=30177):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=41, port=30041):(v=48825003, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=198, port=30198):(v=48824999, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=193, port=30193):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=145, port=30145):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=66, port=30066):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=49, port=30049):(v=48825003, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=28, port=30028):(v=48825003, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=4, port=30004):(v=58590001, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=26, port=30026):(v=48825003, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=181, port=30181):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=55, port=30055):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=96, port=30096):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=17, port=30017):(v=48825003, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=36, port=30036):(v=48825003, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=176, port=30176):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=108, port=30108):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=77, port=30077):(v=48824999, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=126, port=30126):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=136, port=30136):(v=48824999, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=197, port=30197):(v=48824999, e=0),Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=90, port=30090):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=29, port=30029):(v=48825003, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=192, port=30192):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=50, port=30050):(v=48825003, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=3, port=30003):(v=58590001, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=56, port=30056):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=97, port=30097):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=18, port=30018):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=127, port=30127):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=35, port=30035):(v=48825003, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=78, port=30078):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=175, port=30175):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=82, port=30082):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=107, port=30107):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=74, port=30074):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=47, port=30047):(v=48825003, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=1, port=30001):(v=58589999, e=0),Worker(hostname=trantor23.umiacs.umd.edu hostOrIp=trantor23.umiacs.umd.edu, MRtaskID=115, port=30115):(v=48824999, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=38, port=30038):(v=48825003, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=110, port=30110):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=15, port=30015):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=124, port=30124):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=6, port=30006):(v=48825004, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=191, port=30191):(v=48824999, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=182, port=30182):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=174, port=30174):(v=48824999, e=0),Worker(hostname=trantor24.umiacs.umd.edu hostOrIp=trantor24.umiacs.umd.edu, MRtaskID=98, port=30098):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=164, port=30164):(v=48824999, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=65, port=30065):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=24, port=30024):(v=48825003, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=79, port=30079):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=73, port=30073):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=138, port=30138):(v=48824999, e=0),Worker(hostname=trantor00.umiacs.umd.edu hostOrIp=trantor00.umiacs.umd.edu, MRtaskID=81, port=30081):(v=48824999, e=0),Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=5, port=30005):(v=58590001, e=0),Worker(hostname=trantor12.umiacs.umd.edu hostOrIp=trantor12.umiacs.umd.edu, MRtaskID=190, port=30190):(v=48824999, e=0),Worker(hostname=trantor07.umiacs.umd.edu hostOrIp=trantor07.umiacs.umd.edu, MRtaskID=27, port=30027):(v=48825003, e=0),Worker(hostname=trantor10.umiacs.umd.edu hostOrIp=trantor10.umiacs.umd.edu, MRtaskID=37, port=30037):(v=48825003, e=0),Worker(hostname=trantor04.umiacs.umd.edu hostOrIp=trantor04.umiacs.umd.edu, MRtaskID=199, port=30199):(v=48824999, e=0),Worker(hostname=trantor22.umiacs.umd.edu hostOrIp=trantor22.umiacs.umd.edu, MRtaskID=146, port=30146):(v=48824999, e=0),Worker(hostname=trantor08.umiacs.umd.edu hostOrIp=trantor08.umiacs.umd.edu, MRtaskID=57, port=30057):(v=48824999, e=0),Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=trantor02.umiacs.umd.edu, MRtaskID=48, port=30048):(v=48825003, e=0),Worker(hostname=trantor05.umiacs.umd.edu hostOrIp=trantor05.umiacs.umd.edu, MRtaskID=183, port=30183):(v=48824999, e=0),Worker(hostname=trantor13.umiacs.umd.edu hostOrIp=trantor13.umiacs.umd.edu, MRtaskID=173, port=30173):(v=48824999, e=0),Worker(hostname=trantor20.umiacs.umd.edu hostOrIp=trantor20.umiacs.umd.edu, MRtaskID=25, port=30025):(v=48825003, e=0),Worker(hostname=trantor01.umiacs.umd.edu hostOrIp=trantor01.umiacs.umd.edu, MRtaskID=64, port=30064):(v=48824999, e=0),Worker(hostname=trantor03.umiacs.umd.edu hostOrIp=trantor03.umiacs.umd.edu, MRtaskID=16, port=30016):(v=48825003, e=0),Worker(hostname=trantor15.umiacs.umd.edu hostOrIp=trantor15.umiacs.umd.edu, MRtaskID=125, port=30125):(v=48824999, e=0),Worker(hostname=trantor16.umiacs.umd.edu hostOrIp=trantor16.umiacs.umd.edu, MRtaskID=109, port=30109):(v=48824999, e=0),Worker(hostname=trantor21.umiacs.umd.edu hostOrIp=trantor21.umiacs.umd.edu, MRtaskID=163, port=30163):(v=48824999, e=0),Worker(hostname=trantor09.umiacs.umd.edu hostOrIp=trantor09.umiacs.umd.edu, MRtaskID=72, port=30072):(v=48824999, e=0),Worker(hostname=trantor19.umiacs.umd.edu hostOrIp=trantor19.umiacs.umd.edu, MRtaskID=137, port=30137):(v=48824999, e=0),]
>>>>>> 2016-11-08 11:54:21,098 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Vertices - Mean: 49070351, Min: Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=87, port=30087) - 48824999, Max: Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=5, port=30005) - 58590001
>>>>>> 2016-11-08 11:54:21,098 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.partition.PartitionUtils: analyzePartitionStats: Edges - Mean: 0, Min: Worker(hostname=trantor17.umiacs.umd.edu hostOrIp=trantor17.umiacs.umd.edu, MRtaskID=87, port=30087) - 0, Max: Worker(hostname=trantor11.umiacs.umd.edu hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=5, port=30005) - 0
>>>>>> 2016-11-08 11:54:21,104 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 199 workers finished on superstep 1 on path /_hadoopBsp/job_1477020594559_0051/_applicationAttemptsDir/0/_superstepDir/1/_workerFinishedDir
>>>>>> 2016-11-08 11:54:29,090 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: setJobState: {"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1} on superstep 1
>>>>>> 2016-11-08 11:54:29,094 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30044 type:create cxid:0x1b zxid:0xd46 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_masterJobState
>>>>>> 2016-11-08 11:54:29,094 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: setJobState: {"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1}
>>>>>> 2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba3004a type:create cxid:0x1b zxid:0xd47 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_masterJobState
>>>>>> 2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba3004f type:create cxid:0x1b zxid:0xd48 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_masterJobState
>>>>>> 2016-11-08 11:54:29,096 INFO [ProcessThread(sid:0 cport:-1):] org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x15844b61ba30054 type:create cxid:0x1b zxid:0xd49 txntype:-1 reqpath:n/a Error Path:/_hadoopBsp/job_1477020594559_0051/_masterJobState Error:KeeperErrorCode = NodeExists for /_hadoopBsp/job_1477020594559_0051/_masterJobState
>>>>>> 2016-11-08 11:54:29,096 FATAL [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: failJob: Killing job job_1477020594559_0051
>>>>>>
>>>>>>
>>>>>> Any other ideas?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> BR,
>>>>>>
>>>>>>
>>>>>> Hai
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 8, 2016 at 9:48 AM, Denis Dudinski <
>>>>>> denis.dudinski@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Hai,
>>>>>>>
>>>>>>> I think we saw something like this in our environment.
>>>>>>>
>>>>>>> Interesting row is this one:
>>>>>>> 2016-10-27 19:04:00,000 INFO [SessionTracker]
>>>>>>> org.apache.zookeeper.server.ZooKeeperServer: Expiring session
>>>>>>> 0x158084f5b2100b8, timeout of 600000ms exceeded
>>>>>>>
>>>>>>> I think that one of workers due to some reason did not communicate
>>>>>>> with ZooKeeper for quite a long time (it may be heavy network load or
>>>>>>> high CPU consumption, see your monitoring infrastructure, it should
>>>>>>> give you a hint). ZooKeeper session expires and all ephemeral nodes
>>>>>>> for that worker in ZooKeeper tree are deleted. Master thinks that
>>>>>>> worker is dead and halts computation.
>>>>>>>
>>>>>>> Your ZooKeeper session timeout is 600000 ms which is 10 minutes. We
>>>>>>> set this value to much more higher value equal to 1 hour and were
>>>>>>> able
>>>>>>> to perform computations successfully.
>>>>>>>
>>>>>>> I hope it will help in your case too.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Denis Dudinski
>>>>>>>
>>>>>>> 2016-11-08 16:43 GMT+03:00 Hai Lan <lanhai1988@gmail.com>:
>>>>>>> > Hi Guys
>>>>>>> >
>>>>>>> > The OutOfMemoryError might be solved be adding
>>>>>>> > "-Dmapreduce.map.memory.mb=14848". But in my tests, I found some
>>>>>>> more
>>>>>>> > problems during running out of core graph.
>>>>>>> >
>>>>>>> > I did two tests with 150G 10^10 vertices input in 1.2 version, and
>>>>>>> it seems
>>>>>>> > like it not necessary to add like
>>>>>>> > "giraph.userPartitionCount=1000,giraph.maxPartitionsInMemory=1"
>>>>>>> cause it is
>>>>>>> > adaptive. However, If I run without setting "userPartitionCount and
>>>>>>> > maxPartitionsInMemory", it will it will keep running on superstep
>>>>>>> -1
>>>>>>> > forever. None of worker can finish superstep -1. And I can see a
>>>>>>> warn in
>>>>>>> > zookeeper log, not sure if it is the problem:
>>>>>>> >
>>>>>>> > WARN [netty-client-worker-3]
>>>>>>> > org.apache.giraph.comm.netty.handler.ResponseClientHandler:
>>>>>>> exceptionCaught:
>>>>>>> > Channel failed with remote address
>>>>>>> > trantor21.umiacs.umd.edu/192.168.74.221:30172
>>>>>>> > java.lang.ArrayIndexOutOfBoundsException: 1075052544
>>>>>>> >       at
>>>>>>> > org.apache.giraph.comm.flow_control.NoOpFlowControl.getAckSi
>>>>>>> gnalFlag(NoOpFlowControl.java:52)
>>>>>>> >       at
>>>>>>> > org.apache.giraph.comm.netty.NettyClient.messageReceived(Net
>>>>>>> tyClient.java:796)
>>>>>>> >       at
>>>>>>> > org.apache.giraph.comm.netty.handler.ResponseClientHandler.c
>>>>>>> hannelRead(ResponseClientHandler.java:87)
>>>>>>> >       at
>>>>>>> > io.netty.channel.DefaultChannelHandlerContext.invokeChannelR
>>>>>>> ead(DefaultChannelHandlerContext.java:338)
>>>>>>> >       at
>>>>>>> > io.netty.channel.DefaultChannelHandlerContext.fireChannelRea
>>>>>>> d(DefaultChannelHandlerContext.java:324)
>>>>>>> >       at
>>>>>>> > io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>>>>>>> ToMessageDecoder.java:153)
>>>>>>> >       at
>>>>>>> > io.netty.channel.DefaultChannelHandlerContext.invokeChannelR
>>>>>>> ead(DefaultChannelHandlerContext.java:338)
>>>>>>> >       at
>>>>>>> > io.netty.channel.DefaultChannelHandlerContext.fireChannelRea
>>>>>>> d(DefaultChannelHandlerContext.java:324)
>>>>>>> >       at
>>>>>>> > org.apache.giraph.comm.netty.InboundByteCounter.channelRead(
>>>>>>> InboundByteCounter.java:74)
>>>>>>> >       at
>>>>>>> > io.netty.channel.DefaultChannelHandlerContext.invokeChannelR
>>>>>>> ead(DefaultChannelHandlerContext.java:338)
>>>>>>> >       at
>>>>>>> > io.netty.channel.DefaultChannelHandlerContext.fireChannelRea
>>>>>>> d(DefaultChannelHandlerContext.java:324)
>>>>>>> >       at
>>>>>>> > io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>>>>>>> ultChannelPipeline.java:785)
>>>>>>> >       at
>>>>>>> > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.re
>>>>>>> ad(AbstractNioByteChannel.java:126)
>>>>>>> >       at
>>>>>>> > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEven
>>>>>>> tLoop.java:485)
>>>>>>> >       at
>>>>>>> > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimiz
>>>>>>> ed(NioEventLoop.java:452)
>>>>>>> >       at io.netty.channel.nio.NioEventL
>>>>>>> oop.run(NioEventLoop.java:346)
>>>>>>> >       at
>>>>>>> > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(Sin
>>>>>>> gleThreadEventExecutor.java:101)
>>>>>>> >       at java.lang.Thread.run(Thread.java:745)
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > If I add giraph.userPartitionCount=1000
>>>>>>> ,giraph.maxPartitionsInMemory=1.
>>>>>>> > Whole command is :
>>>>>>> >
>>>>>>> > hadoop jar
>>>>>>> > /home/hlan/giraph-1.2.0-hadoop2/giraph-examples/target/girap
>>>>>>> h-examples-1.2.0-hadoop2-for-hadoop-2.6.0-jar-with-dependencies.jar
>>>>>>> > org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreGraph=true
>>>>>>> > -Ddigraph.block_factory_configurators=org.apache.giraph.conf
>>>>>>> .FacebookConfiguration
>>>>>>> > -Dmapreduce.map.memory.mb=14848 org.apache.giraph.examples.myTask
>>>>>>> -vif
>>>>>>> > org.apache.giraph.examples.LongFloatNullTextInputFormat -vip
>>>>>>> > /user/hlan/cube/tmp/out/ -vof
>>>>>>> > org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
>>>>>>> > /user/hlan/output -w 199 -ca
>>>>>>> > mapred.job.tracker=localhost:5431,steps=6,giraph.isStaticGra
>>>>>>> ph=true,giraph.numInputThreads=10,giraph.userPartitionCount=
>>>>>>> 1000,giraph.maxPartitionsInMemory=1
>>>>>>> >
>>>>>>> > the job will be pass superstep -1 very quick (around 10 mins). But
>>>>>>> it will
>>>>>>> > be killed near end of superstep 0.
>>>>>>> >
>>>>>>> > 2016-10-27 18:53:56,607 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.partition.PartitionUtils:
>>>>>>> analyzePartitionStats: Vertices
>>>>>>> > - Mean: 9810049, Min: Worker(hostname=trantor11.umiacs.umd.edu
>>>>>>> > hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) -
>>>>>>> 9771533, Max:
>>>>>>> > Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=
>>>>>>> trantor02.umiacs.umd.edu,
>>>>>>> > MRtaskID=49, port=30049) - 9995724
>>>>>>> > 2016-10-27 18:53:56,608 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.partition.PartitionUtils:
>>>>>>> analyzePartitionStats: Edges -
>>>>>>> > Mean: 0, Min: Worker(hostname=trantor11.umiacs.umd.edu
>>>>>>> > hostOrIp=trantor11.umiacs.umd.edu, MRtaskID=10, port=30010) - 0,
>>>>>>> Max:
>>>>>>> > Worker(hostname=trantor02.umiacs.umd.edu hostOrIp=
>>>>>>> trantor02.umiacs.umd.edu,
>>>>>>> > MRtaskID=49, port=30049) - 0
>>>>>>> > 2016-10-27 18:53:56,634 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:54:26,638 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:54:56,640 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:55:26,641 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:55:56,642 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:56:26,643 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:56:56,644 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:57:26,645 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:57:56,646 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:58:26,647 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:58:56,675 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:59:26,676 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 18:59:56,677 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 19:00:26,678 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 19:00:56,679 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 19:01:26,680 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 19:01:29,610 WARN [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
>>>>>>> exception
>>>>>>> > EndOfStreamException: Unable to read additional data from client
>>>>>>> sessionid
>>>>>>> > 0x158084f5b2100c6, likely client has closed socket
>>>>>>> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn
>>>>>>> .java:220)
>>>>>>> > at
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServ
>>>>>>> erCnxnFactory.java:208)
>>>>>>> > at java.lang.Thread.run(Thread.java:745)
>>>>>>> > 2016-10-27 19:01:29,612 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>>> connection for
>>>>>>> > client /192.168.74.212:53136 which had sessionid 0x158084f5b2100c6
>>>>>>> > 2016-10-27 19:01:31,702 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
>>>>>>> connection
>>>>>>> > from /192.168.74.212:56696
>>>>>>> > 2016-10-27 19:01:31,711 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to
>>>>>>> renew
>>>>>>> > session 0x158084f5b2100c6 at /192.168.74.212:56696
>>>>>>> > 2016-10-27 19:01:31,712 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Established session
>>>>>>> > 0x158084f5b2100c6 with negotiated timeout 600000 for client
>>>>>>> > /192.168.74.212:56696
>>>>>>> > 2016-10-27 19:01:56,681 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 19:02:20,029 WARN [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
>>>>>>> exception
>>>>>>> > EndOfStreamException: Unable to read additional data from client
>>>>>>> sessionid
>>>>>>> > 0x158084f5b2100c5, likely client has closed socket
>>>>>>> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn
>>>>>>> .java:220)
>>>>>>> > at
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServ
>>>>>>> erCnxnFactory.java:208)
>>>>>>> > at java.lang.Thread.run(Thread.java:745)
>>>>>>> > 2016-10-27 19:02:20,030 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>>> connection for
>>>>>>> > client /192.168.74.212:53134 which had sessionid 0x158084f5b2100c5
>>>>>>> > 2016-10-27 19:02:21,584 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
>>>>>>> connection
>>>>>>> > from /192.168.74.212:56718
>>>>>>> > 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to
>>>>>>> renew
>>>>>>> > session 0x158084f5b2100c5 at /192.168.74.212:56718
>>>>>>> > 2016-10-27 19:02:21,608 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Established session
>>>>>>> > 0x158084f5b2100c5 with negotiated timeout 600000 for client
>>>>>>> > /192.168.74.212:56718
>>>>>>> > 2016-10-27 19:02:26,682 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 19:02:56,683 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 19:03:05,743 WARN [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
>>>>>>> exception
>>>>>>> > EndOfStreamException: Unable to read additional data from client
>>>>>>> sessionid
>>>>>>> > 0x158084f5b2100b9, likely client has closed socket
>>>>>>> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn
>>>>>>> .java:220)
>>>>>>> > at
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServ
>>>>>>> erCnxnFactory.java:208)
>>>>>>> > at java.lang.Thread.run(Thread.java:745)
>>>>>>> > 2016-10-27 19:03:05,744 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>>> connection for
>>>>>>> > client /192.168.74.203:51130 which had sessionid 0x158084f5b2100b9
>>>>>>> > 2016-10-27 19:03:07,452 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
>>>>>>> connection
>>>>>>> > from /192.168.74.203:54676
>>>>>>> > 2016-10-27 19:03:07,493 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to
>>>>>>> renew
>>>>>>> > session 0x158084f5b2100b9 at /192.168.74.203:54676
>>>>>>> > 2016-10-27 19:03:07,494 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Established session
>>>>>>> > 0x158084f5b2100b9 with negotiated timeout 600000 for client
>>>>>>> > /192.168.74.203:54676
>>>>>>> > 2016-10-27 19:03:26,684 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 19:03:53,712 WARN [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: caught end of stream
>>>>>>> exception
>>>>>>> > EndOfStreamException: Unable to read additional data from client
>>>>>>> sessionid
>>>>>>> > 0x158084f5b2100be, likely client has closed socket
>>>>>>> > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn
>>>>>>> .java:220)
>>>>>>> > at
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServ
>>>>>>> erCnxnFactory.java:208)
>>>>>>> > at java.lang.Thread.run(Thread.java:745)
>>>>>>> > 2016-10-27 19:03:53,713 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>>> connection for
>>>>>>> > client /192.168.74.203:51146 which had sessionid 0x158084f5b2100be
>>>>>>> > 2016-10-27 19:03:55,436 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket
>>>>>>> connection
>>>>>>> > from /192.168.74.203:54694
>>>>>>> > 2016-10-27 19:03:55,482 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Client attempting to
>>>>>>> renew
>>>>>>> > session 0x158084f5b2100be at /192.168.74.203:54694
>>>>>>> > 2016-10-27 19:03:55,483 INFO [NIOServerCxn.Factory:0.0.0.0/
>>>>>>> 0.0.0.0:22181]
>>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Established session
>>>>>>> > 0x158084f5b2100be with negotiated timeout 600000 for client
>>>>>>> > /192.168.74.203:54694
>>>>>>> > 2016-10-27 19:03:56,719 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0
>>>>>>> out of 199
>>>>>>> > workers finished on superstep 0 on path
>>>>>>> > /_hadoopBsp/job_1477020594559_0049/_applicationAttemptsDir/0
>>>>>>> /_superstepDir/0/_workerFinishedDir
>>>>>>> > 2016-10-27 19:04:00,000 INFO [SessionTracker]
>>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Expiring session
>>>>>>> > 0x158084f5b2100b8, timeout of 600000ms exceeded
>>>>>>> > 2016-10-27 19:04:00,001 INFO [SessionTracker]
>>>>>>> > org.apache.zookeeper.server.ZooKeeperServer: Expiring session
>>>>>>> > 0x158084f5b2100c2, timeout of 600000ms exceeded
>>>>>>> > 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
>>>>>>> > org.apache.zookeeper.server.PrepRequestProcessor: Processed
>>>>>>> session
>>>>>>> > termination for sessionid: 0x158084f5b2100b8
>>>>>>> > 2016-10-27 19:04:00,002 INFO [ProcessThread(sid:0 cport:-1):]
>>>>>>> > org.apache.zookeeper.server.PrepRequestProcessor: Processed
>>>>>>> session
>>>>>>> > termination for sessionid: 0x158084f5b2100c2
>>>>>>> > 2016-10-27 19:04:00,004 INFO [SyncThread:0]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>>> connection for
>>>>>>> > client /192.168.74.203:51116 which had sessionid 0x158084f5b2100b8
>>>>>>> > 2016-10-27 19:04:00,006 INFO [SyncThread:0]
>>>>>>> > org.apache.zookeeper.server.NIOServerCnxn: Closed socket
>>>>>>> connection for
>>>>>>> > client /192.168.74.212:53128 which had sessionid 0x158084f5b2100c2
>>>>>>> > 2016-10-27 19:04:00,033 INFO [org.apache.giraph.master.Mast
>>>>>>> erThread]
>>>>>>> > org.apache.giraph.master.BspServiceMaster: setJobState:
>>>>>>> > {"_applicationAttemptKey":-1,"_stateKey":"FAILED","_superstepKey":-1}
>>>>>>> on
>>>>>>> > superstep 0
>>>>>>> >
>>>>>>> > Any Idea about this?
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> >
>>>>>>> > Hai
>>>>>>> >
>>>>>>> >
>>>>>>> > On Tue, Nov 8, 2016 at 6:37 AM, Denis Dudinski <
>>>>>>> denis.dudinski@gmail.com>
>>>>>>> > wrote:
>>>>>>> >>
>>>>>>> >> Hi Xenia,
>>>>>>> >>
>>>>>>> >> Thank you! I'll check the thread you mentioned.
>>>>>>> >>
>>>>>>> >> Best Regards,
>>>>>>> >> Denis Dudinski
>>>>>>> >>
>>>>>>> >> 2016-11-08 14:16 GMT+03:00 Xenia Demetriou <xeniad20@gmail.com>:
>>>>>>> >> > Hi Denis,
>>>>>>> >> >
>>>>>>> >> > For the "java.lang.OutOfMemoryError: GC overhead limit
>>>>>>> exceeded" error
>>>>>>> >> > I
>>>>>>> >> > hope that the  conversation in below link can help you.
>>>>>>> >> >  www.mail-archive.com/user@giraph.apache.org/msg02938.html
>>>>>>> >> >
>>>>>>> >> > Regards,
>>>>>>> >> > Xenia
>>>>>>> >> >
>>>>>>> >> > 2016-11-08 12:25 GMT+02:00 Denis Dudinski <
>>>>>>> denis.dudinski@gmail.com>:
>>>>>>> >> >>
>>>>>>> >> >> Hi Hassan,
>>>>>>> >> >>
>>>>>>> >> >> Thank you for really quick response!
>>>>>>> >> >>
>>>>>>> >> >> I changed "giraph.isStaticGraph" to false and the error
>>>>>>> disappeared.
>>>>>>> >> >> As expected iteration became slow and wrote to disk edges once
>>>>>>> again
>>>>>>> >> >> in superstep 1.
>>>>>>> >> >>
>>>>>>> >> >> However, the computation failed at superstep 2 with error
>>>>>>> >> >> "java.lang.OutOfMemoryError: GC overhead limit exceeded". It
>>>>>>> seems to
>>>>>>> >> >> be unrelated to "isStaticGraph" issue, but I think it worth
>>>>>>> mentioning
>>>>>>> >> >> to see the picture as a whole.
>>>>>>> >> >>
>>>>>>> >> >> Are there any other tests/information I am able to
>>>>>>> execute/check to
>>>>>>> >> >> help to pinpoint "isStaticGraph" problem?
>>>>>>> >> >>
>>>>>>> >> >> Best Regards,
>>>>>>> >> >> Denis Dudinski
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> 2016-11-07 20:00 GMT+03:00 Hassan Eslami <hsn.eslami@gmail.com
>>>>>>> >:
>>>>>>> >> >> > Hi Denis,
>>>>>>> >> >> >
>>>>>>> >> >> > Thanks for bringing up the issue. In the previous
>>>>>>> conversation
>>>>>>> >> >> > thread,
>>>>>>> >> >> > the
>>>>>>> >> >> > similar problem is reported even with a simpler example
>>>>>>> connected
>>>>>>> >> >> > component
>>>>>>> >> >> > calculation. Although, back then, we were developing other
>>>>>>> >> >> > performance-critical components of OOC.
>>>>>>> >> >> >
>>>>>>> >> >> > Let's debug this issue together to make the new OOC more
>>>>>>> stable. I
>>>>>>> >> >> > suspect
>>>>>>> >> >> > the problem is with "giraph.isStaticGraph=true" (as this is
>>>>>>> only an
>>>>>>> >> >> > optimization and most of our end-to-end testing was on cases
>>>>>>> where
>>>>>>> >> >> > the
>>>>>>> >> >> > graph
>>>>>>> >> >> > could change). Let's get rid of it for now and see if the
>>>>>>> problem
>>>>>>> >> >> > still
>>>>>>> >> >> > exists.
>>>>>>> >> >> >
>>>>>>> >> >> > Best,
>>>>>>> >> >> > Hassan
>>>>>>> >> >> >
>>>>>>> >> >> > On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski
>>>>>>> >> >> > <denis.dudinski@gmail.com>
>>>>>>> >> >> > wrote:
>>>>>>> >> >> >>
>>>>>>> >> >> >> Hello,
>>>>>>> >> >> >>
>>>>>>> >> >> >> We are trying to calculate PageRank on huge graph, which
>>>>>>> does not
>>>>>>> >> >> >> fit
>>>>>>> >> >> >> into memory. For calculation to succeed we tried to turn on
>>>>>>> >> >> >> OutOfCore
>>>>>>> >> >> >> feature of Giraph, but every launch we tried resulted in
>>>>>>> >> >> >> com.esotericsoftware.kryo.KryoException: Buffer underflow.
>>>>>>> >> >> >> Each time it happens on different servers but exactly right
>>>>>>> after
>>>>>>> >> >> >> start of superstep 1.
>>>>>>> >> >> >>
>>>>>>> >> >> >> We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod
>>>>>>> version, can't
>>>>>>> >> >> >> back-step to Giraph's officially supported version and had
>>>>>>> to patch
>>>>>>> >> >> >> Giraph a little) placed on 11 servers + 3 master servers
>>>>>>> (namenodes
>>>>>>> >> >> >> etc.) with separate ZooKeeper cluster deployment.
>>>>>>> >> >> >>
>>>>>>> >> >> >> Our launch command:
>>>>>>> >> >> >>
>>>>>>> >> >> >> hadoop jar /opt/giraph-1.2.0/pr-job-jar-w
>>>>>>> ith-dependencies.jar
>>>>>>> >> >> >> org.apache.giraph.GiraphRunner
>>>>>>> >> >> >> com.prototype.di.pr.PageRankComputation
>>>>>>> >> >> >> \
>>>>>>> >> >> >> -mc com.prototype.di.pr.PageRankMasterCompute \
>>>>>>> >> >> >> -yj pr-job-jar-with-dependencies.jar \
>>>>>>> >> >> >> -vif com.belprime.di.pr.input.HBLongVertexInputFormat \
>>>>>>> >> >> >> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat
>>>>>>> \
>>>>>>> >> >> >> -op /user/hadoop/output/pr_test \
>>>>>>> >> >> >> -w 10 \
>>>>>>> >> >> >> -c com.prototype.di.pr.PRDoubleCombiner \
>>>>>>> >> >> >> -wc com.prototype.di.pr.PageRankWorkerContext \
>>>>>>> >> >> >> -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \
>>>>>>> >> >> >> -ca giraph.logLevel=info \
>>>>>>> >> >> >> -ca hbase.mapreduce.inputtable=di_test \
>>>>>>> >> >> >> -ca hbase.mapreduce.scan.columns=di:n \
>>>>>>> >> >> >> -ca hbase.defaults.for.version.skip=true \
>>>>>>> >> >> >> -ca hbase.table.row.textkey=false \
>>>>>>> >> >> >> -ca giraph.yarn.task.heap.mb=48000 \
>>>>>>> >> >> >> -ca giraph.isStaticGraph=true \
>>>>>>> >> >> >> -ca giraph.SplitMasterWorker=false \
>>>>>>> >> >> >> -ca giraph.oneToAllMsgSending=true \
>>>>>>> >> >> >> -ca giraph.metrics.enable=true \
>>>>>>> >> >> >> -ca giraph.jmap.histo.enable=true \
>>>>>>> >> >> >> -ca
>>>>>>> >> >> >> giraph.vertexIdClass=com.prototype.di.pr
>>>>>>> .DomainPartAwareLongWritable
>>>>>>> >> >> >> \
>>>>>>> >> >> >> -ca
>>>>>>> >> >> >> giraph.outgoingMessageValueClass=org.apache.hadoop.io
>>>>>>> .DoubleWritable
>>>>>>> >> >> >> \
>>>>>>> >> >> >> -ca
>>>>>>> >> >> >> giraph.inputOutEdgesClass=org.
>>>>>>> apache.giraph.edge.LongNullArrayEdges
>>>>>>> >> >> >> \
>>>>>>> >> >> >> -ca giraph.useOutOfCoreGraph=true \
>>>>>>> >> >> >> -ca giraph.waitForPerWorkerRequests=true \
>>>>>>> >> >> >> -ca giraph.maxNumberOfUnsentRequests=1000 \
>>>>>>> >> >> >> -ca
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> giraph.vertexInputFilterClass=
>>>>>>> com.prototype.di.pr.input.PagesFromSameDomainLimiter
>>>>>>> >> >> >> \
>>>>>>> >> >> >> -ca giraph.useInputSplitLocality=true \
>>>>>>> >> >> >> -ca hbase.mapreduce.scan.cachedrows=10000 \
>>>>>>> >> >> >> -ca giraph.minPartitionsPerComputeThread=60 \
>>>>>>> >> >> >> -ca
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> giraph.graphPartitionerFactoryClass=com.prototype.di.pr
>>>>>>> .DomainAwareGraphPartitionerFactory
>>>>>>> >> >> >> \
>>>>>>> >> >> >> -ca giraph.numInputThreads=1 \
>>>>>>> >> >> >> -ca giraph.inputSplitSamplePercent=20 \
>>>>>>> >> >> >> -ca giraph.pr.maxNeighborsPerVertex=50 \
>>>>>>> >> >> >> -ca
>>>>>>> >> >> >> giraph.partitionClass=org.apac
>>>>>>> he.giraph.partition.ByteArrayPartition
>>>>>>> >> >> >> \
>>>>>>> >> >> >> -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex
>>>>>>> \
>>>>>>> >> >> >> -ca
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> giraph.partitionsDirectory=/di
>>>>>>> sk1/_bsp/_partitions,/disk2/_bsp/_partitions
>>>>>>> >> >> >>
>>>>>>> >> >> >> Logs excerpt:
>>>>>>> >> >> >>
>>>>>>> >> >> >> 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre
>>>>>>> superstep in
>>>>>>> >> >> >> worker
>>>>>>> >> >> >> context
>>>>>>> >> >> >> 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60
>>>>>>> >> >> >> partitions
>>>>>>> >> >> >> to process with 1 compute thread(s), originally 1 thread(s)
>>>>>>> on
>>>>>>> >> >> >> superstep 1
>>>>>>> >> >> >> 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration:
>>>>>>> with 60
>>>>>>> >> >> >> partitions in memory and 1 active threads
>>>>>>> >> >> >> 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre
>>>>>>> superstep1 in PR
>>>>>>> >> >> >> computation
>>>>>>> >> >> >> 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> getNextIOActions:
>>>>>>> >> >> >> usedMemoryFraction = 0.75
>>>>>>> >> >> >> 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine:
>>>>>>> >> >> >> updateActiveThreadsFraction: updating the number of active
>>>>>>> threads
>>>>>>> >> >> >> to
>>>>>>> >> >> >> 1
>>>>>>> >> >> >> 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> updateRequestsCredit: updating the credit to 20
>>>>>>> >> >> >> 16/11/06 15:47:17 INFO graph.GraphTaskManager:
>>>>>>> installGCMonitoring:
>>>>>>> >> >> >> name = PS Scavenge, action = end of minor GC, cause =
>>>>>>> Allocation
>>>>>>> >> >> >> Failure, duration = 937ms
>>>>>>> >> >> >> 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> getNextIOActions:
>>>>>>> >> >> >> usedMemoryFraction = 0.72
>>>>>>> >> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> getNextIOActions:
>>>>>>> >> >> >> usedMemoryFraction = 0.74
>>>>>>> >> >> >> 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine:
>>>>>>> >> >> >> updateActiveThreadsFraction: updating the number of active
>>>>>>> threads
>>>>>>> >> >> >> to
>>>>>>> >> >> >> 1
>>>>>>> >> >> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> updateRequestsCredit: updating the credit to 20
>>>>>>> >> >> >> 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> getNextIOActions:
>>>>>>> >> >> >> usedMemoryFraction = 0.76
>>>>>>> >> >> >> 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine:
>>>>>>> doneProcessingPartition:
>>>>>>> >> >> >> processing partition 234 is done!
>>>>>>> >> >> >> 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> getNextIOActions:
>>>>>>> >> >> >> usedMemoryFraction = 0.79
>>>>>>> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine:
>>>>>>> >> >> >> updateActiveThreadsFraction: updating the number of active
>>>>>>> threads
>>>>>>> >> >> >> to
>>>>>>> >> >> >> 1
>>>>>>> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> updateRequestsCredit: updating the credit to 18
>>>>>>> >> >> >> 16/11/06 15:47:21 INFO handler.RequestDecoder: decode:
>>>>>>> Server window
>>>>>>> >> >> >> metrics MBytes/sec received = 1.0994, MBytesReceived =
>>>>>>> 33.0459, ave
>>>>>>> >> >> >> received req MBytes = 0.0138, secs waited = 30.058
>>>>>>> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> getNextIOActions:
>>>>>>> >> >> >> usedMemoryFraction = 0.82
>>>>>>> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call:
>>>>>>> thread 0's
>>>>>>> >> >> >> next
>>>>>>> >> >> >> IO command is: StorePartitionIOCommand: (partitionId = 234)
>>>>>>> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call:
>>>>>>> thread 0's
>>>>>>> >> >> >> command StorePartitionIOCommand: (partitionId = 234)
>>>>>>> completed:
>>>>>>> >> >> >> bytes=
>>>>>>> >> >> >> 64419740, duration=351, bandwidth=175.03, bandwidth
>>>>>>> (excluding GC
>>>>>>> >> >> >> time)=175.03
>>>>>>> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> getNextIOActions:
>>>>>>> >> >> >> usedMemoryFraction = 0.83
>>>>>>> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call:
>>>>>>> thread 0's
>>>>>>> >> >> >> next
>>>>>>> >> >> >> IO command is: StoreIncomingMessageIOCommand: (partitionId
>>>>>>> = 234)
>>>>>>> >> >> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call:
>>>>>>> thread 0's
>>>>>>> >> >> >> command StoreIncomingMessageIOCommand: (partitionId = 234)
>>>>>>> >> >> >> completed:
>>>>>>> >> >> >> bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC
>>>>>>> >> >> >> time)=NaN
>>>>>>> >> >> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> getNextIOActions:
>>>>>>> >> >> >> usedMemoryFraction = 0.83
>>>>>>> >> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager:
>>>>>>> installGCMonitoring:
>>>>>>> >> >> >> name = PS Scavenge, action = end of minor GC, cause =
>>>>>>> Allocation
>>>>>>> >> >> >> Failure, duration = 3107ms
>>>>>>> >> >> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager:
>>>>>>> installGCMonitoring:
>>>>>>> >> >> >> name = PS MarkSweep, action = end of major GC, cause =
>>>>>>> Ergonomics,
>>>>>>> >> >> >> duration = 15064ms
>>>>>>> >> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine:
>>>>>>> >> >> >> updateActiveThreadsFraction: updating the number of active
>>>>>>> threads
>>>>>>> >> >> >> to
>>>>>>> >> >> >> 1
>>>>>>> >> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> updateRequestsCredit: updating the credit to 20
>>>>>>> >> >> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> getNextIOActions:
>>>>>>> >> >> >> usedMemoryFraction = 0.71
>>>>>>> >> >> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call:
>>>>>>> thread 0's
>>>>>>> >> >> >> next
>>>>>>> >> >> >> IO command is: LoadPartitionIOCommand: (partitionId = 234,
>>>>>>> superstep
>>>>>>> >> >> >> =
>>>>>>> >> >> >> 2)
>>>>>>> >> >> >> JMap histo dump at Sun Nov 06 15:47:41 CET 2016
>>>>>>> >> >> >> 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine:
>>>>>>> doneProcessingPartition:
>>>>>>> >> >> >> processing partition 364 is done!
>>>>>>> >> >> >> 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine:
>>>>>>> >> >> >> updateActiveThreadsFraction: updating the number of active
>>>>>>> threads
>>>>>>> >> >> >> to
>>>>>>> >> >> >> 1
>>>>>>> >> >> >> 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle:
>>>>>>> >> >> >> updateRequestsCredit: updating the credit to 20
>>>>>>> >> >> >> --
>>>>>>> >> >> >> -- num     #instances         #bytes  class name
>>>>>>> >> >> >> -- ----------------------------------------------
>>>>>>> >> >> >> --   1:     224004229    10752202992
>>>>>>> >> >> >> java.util.concurrent.ConcurrentHashMap$Node
>>>>>>> >> >> >> --   2:      19751666     6645730528  [B
>>>>>>> >> >> >> --   3:     222135985     5331263640
>>>>>>> >> >> >> com.belprime.di.pr.DomainPartAwareLongWritable
>>>>>>> >> >> >> --   4:     214686483     5152475592
>>>>>>> >> >> >> org.apache.hadoop.io.DoubleWritable
>>>>>>> >> >> >> --   5:           353     4357261784
>>>>>>> >> >> >> [Ljava.util.concurrent.ConcurrentHashMap$Node;
>>>>>>> >> >> >> --   6:        486266      204484688  [I
>>>>>>> >> >> >> --   7:       6017652      192564864
>>>>>>> >> >> >> org.apache.giraph.utils.UnsafeByteArrayOutputStream
>>>>>>> >> >> >> --   8:       3986203      159448120
>>>>>>> >> >> >> org.apache.giraph.utils.UnsafeByteArrayInputStream
>>>>>>> >> >> >> --   9:       2064182      148621104
>>>>>>> >> >> >> org.apache.giraph.graph.ByteValueVertex
>>>>>>> >> >> >> --  10:       2064182       82567280
>>>>>>> >> >> >> org.apache.giraph.edge.ByteArrayEdges
>>>>>>> >> >> >> --  11:       1886875       45285000  java.lang.Integer
>>>>>>> >> >> >> --  12:        349409       30747992
>>>>>>> >> >> >> java.util.concurrent.ConcurrentHashMap$TreeNode
>>>>>>> >> >> >> --  13:        916970       29343040
>>>>>>> java.util.Collections$1
>>>>>>> >> >> >> --  14:        916971       22007304
>>>>>>> >> >> >> java.util.Collections$SingletonSet
>>>>>>> >> >> >> --  15:         47270        3781600
>>>>>>> >> >> >> java.util.concurrent.ConcurrentHashMap$TreeBin
>>>>>>> >> >> >> --  16:         26201        2590912  [C
>>>>>>> >> >> >> --  17:         34175        1367000
>>>>>>> >> >> >> org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator
>>>>>>> >> >> >> --  18:          6143        1067704  java.lang.Class
>>>>>>> >> >> >> --  19:         25953         830496  java.lang.String
>>>>>>> >> >> >> --  20:         34175         820200
>>>>>>> >> >> >> org.apache.giraph.edge.EdgeNoValue
>>>>>>> >> >> >> --  21:          4488         703400  [Ljava.lang.Object;
>>>>>>> >> >> >> --  22:            70         395424
>>>>>>> >> >> >> [Ljava.nio.channels.SelectionKey;
>>>>>>> >> >> >> --  23:          2052         328320
>>>>>>> java.lang.reflect.Method
>>>>>>> >> >> >> --  24:          6600         316800
>>>>>>> >> >> >> org.apache.giraph.utils.ByteArrayVertexIdMessages
>>>>>>> >> >> >> --  25:          5781         277488  java.util.HashMap$Node
>>>>>>> >> >> >> --  26:          5651         271248
>>>>>>> java.util.Hashtable$Entry
>>>>>>> >> >> >> --  27:          6604         211328
>>>>>>> >> >> >> org.apache.giraph.factories.DefaultMessageValueFactory
>>>>>>> >> >> >> 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable:
>>>>>>> Execution of
>>>>>>> >> >> >> callable failed
>>>>>>> >> >> >> java.lang.RuntimeException: call: execution of IO command
>>>>>>> >> >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2)
>>>>>>> failed!
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCor
>>>>>>> eIOCallable.call(OutOfCoreIOCallable.java:115)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCor
>>>>>>> eIOCallable.call(OutOfCoreIOCallable.java:36)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.utils.LogSta
>>>>>>> cktraceCallable.call(LogStacktraceCallable.java:67)
>>>>>>> >> >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> java.util.concurrent.ThreadPoo
>>>>>>> lExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> java.util.concurrent.ThreadPoo
>>>>>>> lExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>> >> >> >> at java.lang.Thread.run(Thread.java:745)
>>>>>>> >> >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer
>>>>>>> >> >> >> underflow.
>>>>>>> >> >> >> at com.esotericsoftware.kryo.io.I
>>>>>>> nput.require(Input.java:199)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >> com.esotericsoftware.kryo.io.U
>>>>>>> nsafeInput.readLong(UnsafeInput.java:112)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> com.esotericsoftware.kryo.io.K
>>>>>>> ryoDataInput.readLong(KryoDataInput.java:91)
>>>>>>> >> >> >> at
>>>>>>> >> >> >> org.apache.hadoop.io.LongWrita
>>>>>>> ble.readFields(LongWritable.java:47)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.data.Dis
>>>>>>> kBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.
>>>>>>> java:245)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.data.Dis
>>>>>>> kBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPa
>>>>>>> rtitionStore.java:278)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.data.Dis
>>>>>>> kBackedDataStore.loadPartitionDataProxy(DiskBackedDataStore.
>>>>>>> java:234)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.data.Dis
>>>>>>> kBackedPartitionStore.loadPartitionData(DiskBackedPartitionS
>>>>>>> tore.java:311)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.command.
>>>>>>> LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCor
>>>>>>> eIOCallable.call(OutOfCoreIOCallable.java:99)
>>>>>>> >> >> >> ... 6 more
>>>>>>> >> >> >> 16/11/06 15:47:49 FATAL graph.GraphTaskManager:
>>>>>>> uncaughtException:
>>>>>>> >> >> >> OverrideExceptionHandler on thread ooc-io-0, msg = call:
>>>>>>> execution
>>>>>>> >> >> >> of
>>>>>>> >> >> >> IO command LoadPartitionIOCommand: (partitionId = 234,
>>>>>>> superstep =
>>>>>>> >> >> >> 2)
>>>>>>> >> >> >> failed!, exiting...
>>>>>>> >> >> >> java.lang.RuntimeException: call: execution of IO command
>>>>>>> >> >> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2)
>>>>>>> failed!
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCor
>>>>>>> eIOCallable.call(OutOfCoreIOCallable.java:115)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCor
>>>>>>> eIOCallable.call(OutOfCoreIOCallable.java:36)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.utils.LogSta
>>>>>>> cktraceCallable.call(LogStacktraceCallable.java:67)
>>>>>>> >> >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> java.util.concurrent.ThreadPoo
>>>>>>> lExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> java.util.concurrent.ThreadPoo
>>>>>>> lExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>> >> >> >> at java.lang.Thread.run(Thread.java:745)
>>>>>>> >> >> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer
>>>>>>> >> >> >> underflow.
>>>>>>> >> >> >> at com.esotericsoftware.kryo.io.I
>>>>>>> nput.require(Input.java:199)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >> com.esotericsoftware.kryo.io.U
>>>>>>> nsafeInput.readLong(UnsafeInput.java:112)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> com.esotericsoftware.kryo.io.K
>>>>>>> ryoDataInput.readLong(KryoDataInput.java:91)
>>>>>>> >> >> >> at
>>>>>>> >> >> >> org.apache.hadoop.io.LongWrita
>>>>>>> ble.readFields(LongWritable.java:47)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.data.Dis
>>>>>>> kBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.
>>>>>>> java:245)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.data.Dis
>>>>>>> kBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPa
>>>>>>> rtitionStore.java:278)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.data.Dis
>>>>>>> kBackedDataStore.loadPartitionDataProxy(DiskBackedDataStore.
>>>>>>> java:234)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.data.Dis
>>>>>>> kBackedPartitionStore.loadPartitionData(DiskBackedPartitionS
>>>>>>> tore.java:311)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.command.
>>>>>>> LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66)
>>>>>>> >> >> >> at
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> org.apache.giraph.ooc.OutOfCor
>>>>>>> eIOCallable.call(OutOfCoreIOCallable.java:99)
>>>>>>> >> >> >> ... 6 more
>>>>>>> >> >> >> 16/11/06 15:47:49 ERROR worker.BspServiceWorker:
>>>>>>> unregisterHealth:
>>>>>>> >> >> >> Got
>>>>>>> >> >> >> failure, unregistering health on
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> /_hadoopBsp/giraph_yarn_applic
>>>>>>> ation_1478342673283_0009/_applicationAttemptsDir/0/_superste
>>>>>>> pDir/1/_workerHealthyDir/datanode6.webmeup.com_5
>>>>>>> >> >> >> on superstep 1
>>>>>>> >> >> >>
>>>>>>> >> >> >> We looked into one thread
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> http://mail-archives.apache.or
>>>>>>> g/mod_mbox/giraph-user/201607.mbox/%3CCAECWHa3MOqubf8--wMVhz
>>>>>>> qOYwwZ0ZuP6_iiqTE_xT%3DoLJAAPQw%40mail.gmail.com%3E
>>>>>>> >> >> >> but it is rather old and at that time the answer was "do
>>>>>>> not use it
>>>>>>> >> >> >> yet".
>>>>>>> >> >> >> (see reply
>>>>>>> >> >> >>
>>>>>>> >> >> >>
>>>>>>> >> >> >> http://mail-archives.apache.or
>>>>>>> g/mod_mbox/giraph-user/201607.mbox/%3CCAH1LQfdbpbZuaKsu1b7TC
>>>>>>> wOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw%40mail.gmail.com%3E).
>>>>>>> >> >> >> Does it hold today? We would like to use new advanced
>>>>>>> adaptive OOC
>>>>>>> >> >> >> approach if possible...
>>>>>>> >> >> >>
>>>>>>> >> >> >> Thank you in advance, any help or hint would be really
>>>>>>> appreciated.
>>>>>>> >> >> >>
>>>>>>> >> >> >> Best Regards,
>>>>>>> >> >> >> Denis Dudinski
>>>>>>> >> >> >
>>>>>>> >> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message