From user-return-3682-archive-asf-public=cust-asf.ponee.io@giraph.apache.org Tue Dec 11 04:39:13 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 2DC81180627 for ; Tue, 11 Dec 2018 04:39:12 +0100 (CET) Received: (qmail 74241 invoked by uid 500); 11 Dec 2018 03:39:11 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 74231 invoked by uid 99); 11 Dec 2018 03:39:10 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Dec 2018 03:39:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 82B41C5505 for ; Tue, 11 Dec 2018 03:39:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.798 X-Spam-Level: * X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, KAM_SHORT=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id rWIsH8omfOyj for ; Tue, 11 Dec 2018 03:39:08 +0000 (UTC) Received: from mail-yb1-f182.google.com (mail-yb1-f182.google.com [209.85.219.182]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 1E1325FDE5 for ; Tue, 11 Dec 2018 03:30:40 +0000 (UTC) Received: by mail-yb1-f182.google.com with SMTP id z2-v6so6154171ybj.2 for ; Mon, 10 Dec 2018 19:30:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=goOrUq54c7DKlTKrP60vpBJONsWyt6WTQJDL7HKKaLs=; b=oSsqnYkXpjhkWha2YR1QSqD3kK+Hj4w4qcC5KQ4vM7jUNoIQw8AfQmep3BJO/PiSgg soaEJGr6RaDUzlYmGXq1CmgJJWIZ7p4ZoYA23Mw4bQACUBQZhNs1a9e0uP85cpXxmVHI fZaKySiTdGHR/54bjWMEmhXkH8nyb2m8DpPNl/Ebw146UqfjEqqQNNYFxkNR0QkWc+aj WsQw8yHIibMPl60NbURwRahHfnVI87VOAYQTDGpG94M+WOnzRPJVQkZ14gZlfl3V4gVY Ti4pURFAsPweLzQbfwvEtrP9iBaV5lvU7IcFRrYP4y3scHT1Cp25ybsf38qeSVIcy+U0 CCmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=goOrUq54c7DKlTKrP60vpBJONsWyt6WTQJDL7HKKaLs=; b=SXumkGXtT1XeyZ/f1NEyo+adKZG28KuF/GdMFOdR4Z9DxVh3UmBBKNxysIjsaCj8u2 94A/wXWcGmb8n/k3sJojaxFwJHHafP9PKNfGnawh+4gWBj/39ikTaapPfHSGupL67Ndm t6FSPHORv8JnE/bMSEZDqsqu1TSHhRuM/2F7+KLwLTTyUo5QbgnuBnYeK9aSa22ZSZpU I0zp5fl35VCMuk/GDofulvFR27zbrZOZNLj+OR5IREH+EIfZkd53Z+yAwrOcdGnIGjRX N396k7daIMF+rByl6ArIl/kij/oE39Uxrl2qf2y1JZzzElug096rx1NScxDgfcK2mQlQ AluQ== X-Gm-Message-State: AA+aEWabNcm1/sKbpJdO5dbxADK3FaQsmyMHl6+jsmORZ+p+wBxGD6Tq /o3FTAXrEmkw4YfUweGTYTUdsSVtCReNILCb/z04KQ== X-Google-Smtp-Source: AFSGD/VSowEClMIsZU5cRN+Z45uyUc4Dd3hhmP6am0i/XVwuXRLu1sCD/SCmGgL0uyCdcNG3AqRohrCYOp7Rozen+3E= X-Received: by 2002:a25:ac8e:: with SMTP id x14mr10074271ybi.106.1544499038816; Mon, 10 Dec 2018 19:30:38 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Eli Reisman Date: Mon, 10 Dec 2018 19:30:27 -0800 Message-ID: Subject: Re: 4-profiles calculus on big graphs To: user@giraph.apache.org Content-Type: multipart/alternative; boundary="000000000000908e4d057cb6b44c" --000000000000908e4d057cb6b44c Content-Type: text/plain; charset="UTF-8" Hi Cristina, First of all, the YARN support for Giraph is not well maintained right now, so it's going to be rough around the edges. Thanks for your detailed post, there's lots of good info there. Some off the top of my head ideas: - I think the buffers you're setting (esp. netty level but possibly also Giraph buffers) are probably bit big for the cluster you're running - You can upload app-level JARs and dependencies to the YARN cache rather than put them in with the Hadoop lib jars. There's a command line arg to specify the local copies for upload to the cache at the command line when you run your job - From your logs, it looks like you're losing a node sometime during superstep 2 and Giraph isn't handling the failure properly? My suggestion is try more YARN nodes, more memory, and less resources devoted to buffers in the configs, see if you can identify anywhere you might be creating message amplification without realizing, and consider trying a run on a non-YARN Hadoop cluster if it's feasible Hope that helps, good luck with your thesis work! Eli On Sat, Dec 8, 2018 at 11:17 AM Cristina Bovi wrote: > Hi, for my master thesis in computer science I succeed in implementing > 4-profiles calculus (https://arxiv.org/abs/1510.02215 - > http://eelenberg.github.io/Elenberg4profileWWW16.pdf) using > giraph-1.3.0-snapshot (compiled with -Phadoop_yarn profile) and > hadoop-2.8.4. > > I configured a cluster on amazon ec2 composed of 1 namenode and 5 > datanodes using t2.2xlarge (32GB, 8CPU) instances and I obtained results > described in attached file (available also here https://we.tl/t-7DuNJSSuN3) > with input graphs of small/medium dimensions. > > If I try to give in input to my giraph program bigger input graphs (e.g. > like http://snap.stanford.edu/data/web-NotreDame.html) in some cases I > obtain many errors related to netty and the yarn application FAILS, in > other cases the yarn application remains in a RUNNING UNDEFINED state (then > I killed it instead of waiting the default timeout) without apparently no > error. I also tried to use m5.4xlarge (64GB, 16CPU) but I obtained same > problems. I reported log errors of first case here: > > - log of errors by giraph worker on datanodes pasted here: > https://pastebin.com/CGHUd0za (same errors in all datanodes) > - log of errors by giraph master pasted here: > https://pastebin.com/JXYN6y4L > > I'm quite sure that errors are not related to insufficient memory of EC2 > instances because in the log I always saw messages like "(free/total/max) = > 23038.28M / 27232.00M / 27232.00M". *Please help me because my master > thesis is blocked with this problem :-(* > > This is an example of command that I used to run giraph, could you please > check if parameters that I used are correct? Any other tuning will be > appreciated! > > giraph 4Profiles-0.0.1-SNAPSHOT.jar > it.uniroma1.di.fourprofiles.worker.superstep0.gas1.Worker_Superstep0_GAS1 > -ca giraph.numComputeThreads=8 // Since t2.2xlarge has 8 CORES, is it > correct to set these parameters to 8? > -ca giraph.numInputThreads=8 > -ca giraph.numOutputThreads=8 > > -w 8 // I set 8 workers since: > // - 5 datanodes on EC2" > // - every datanode configured for max 2 containers in order to > reduce messages between datanodes > // - 2 containers are reserved for application master and giraph > master > // - (5 datanodes * 2 max containers) - 2 reserved = 8 workers > // Is it correct as reasoning? > > -yh 15360 // I set 15360 since it corresponds to > // - yarn.scheduler.minimum-allocation-mb property in > yarn-site.xml > // - mapreduce.map.memory.mb property in mapred-site.xml > // Is it correct as reasoning? > > -ca giraph.pure.yarn.job=true > -mc it.uniroma1.di.fourprofiles.master.Master_FourProfiles > -ca io.edge.reverse.duplicator=true > -eif > it.uniroma1.di.fourprofiles.io.format.IntEdgeData_TextEdgeInputFormat_ReverseEdgeDuplicator > > -eip INPUT_GRAPHS/HU_edges.txt-processed > -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat > -op output > -ca giraph.SplitMasterWorker=true > -ca > giraph.messageCombinerClass=it.uniroma1.di.fourprofiles.worker.msgcombiner.Worker_MsgCombiner > > -ca > giraph.master.observers=it.uniroma1.di.fourprofiles.master.observer.Observer_FourProfiles > > -ca giraph.metrics.enable=true > -ca giraph.useInputSplitLocality=false > -ca giraph.useBigDataIOForMessages=true > -ca giraph.useMessageSizeEncoding=true > -ca giraph.oneToAllMsgSending=true > -ca giraph.isStaticGraph=true > > Furthermore I tried to use following netty parameters but I didn't resolve > the problems. Could you please help me if I miss some important parameter > or maybe I used it in a wrong way? I tried to generalize the value passed > to netty parameters with a trivial formula nettyFactor*defaultValue where > nettyFactor can be 1, 2, 3, ... (passed by shell parameter) > > -ca giraph.nettyAutoRead=true > -ca giraph.channelsPerServer=$((nettyFactor*1)) > -ca giraph.nettyClientThreads=$((nettyFactor*4)) > -ca giraph.nettyClientExecutionThreads=$((nettyFactor*8)) > -ca giraph.nettyServerThreads=$((nettyFactor*16)) > -ca giraph.nettyServerExecutionThreads=$((nettyFactor*8)) > -ca giraph.clientSendBufferSize=$((nettyFactor*524288)) > -ca giraph.clientReceiveBufferSize=$((nettyFactor*32768)) > -ca giraph.serverSendBufferSize=$((nettyFactor*32768)) > -ca giraph.serverReceiveBufferSize=$((nettyFactor*524288)) > -ca giraph.vertexRequestSize=$((nettyFactor*524288)) > -ca giraph.edgeRequestSize=$((nettyFactor*524288)) > -ca giraph.msgRequestSize=$((nettyFactor*524288)) > -ca giraph.nettyRequestEncoderBufferSize=$((nettyFactor*32768)) > > > > > ... I have other questions: > 1) > This is my hadoop configuration https://we.tl/t-t1ItNYFe7H Please check > it but I'm quite sure that is correct. I have only a question about it: > since giraph does not use "reduce", is it correct to assing 0 MB to > mapreduce.reduce.memory.mb in mapred-site.xml? > > 2) > In order to avoid ClassNotFoundException error I copied the jar of my > giraph application and all giraph jars from $GIRAPH_HOME and > $GIRAPH_HOME/lib to $HADOOP_HOME/share/hadoop/yarn/lib. Is there a better > solution? > > 3) > Last but not least: Here https://we.tl/t-tdhuZFsVJW you can find the > completed hadoop/yarn log of my giraph program with following graph > http://snap.stanford.edu/data/web-NotreDame.html as input. In this case > the yarn application reamins in RUNNING UNDEFINED state. > > > Thanks > -- > Cristina Bovi > --000000000000908e4d057cb6b44c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Cristina,

First of all, the YARN sup= port for Giraph is not well maintained right now, so it's going to be r= ough around the edges. Thanks for your detailed post, there's lots of g= ood info there. Some off the top of my head ideas:

- I think the buffers you're setting (esp. netty level but possibly al= so Giraph buffers) are probably bit big for the cluster you're running<= /div>
- You can upload app-level JARs and dependencies to the YARN cach= e rather than put them in with the Hadoop lib jars. There's a command l= ine arg to specify the local copies for upload to the cache at the command = line when you run your job
- From your logs, it looks like you= 9;re losing a node sometime during superstep 2 and Giraph isn't handlin= g the failure properly? My suggestion is try more YARN nodes, more memory, = and less resources devoted to buffers in the configs, see if you can identi= fy anywhere you might be creating message amplification without realizing, = and consider trying a run on a non-YARN Hadoop cluster if it's feasible=

Hope that helps, good luck with your thesis work!=
Eli

O= n Sat, Dec 8, 2018 at 11:17 AM Cristina Bovi <cristybo@gmail.com> wrote:
Hi, for my master thesis in computer science= I succeed in implementing 4-profiles calculus (https://arxiv.org/abs/1510.02215 - = http://eelenberg.github.io/Elenberg4profileWWW16.pdf) using gir= aph-1.3.0-snapshot (compiled with -Phadoop_yarn profile) and hadoop-2.8.4.<= br>
I configured a cluster on amazon ec2 composed of 1 namenode and 5 da= tanodes using t2.2xlarge (32GB, 8CPU) instances and I obtained results desc= ribed in attached file (available also here https://we.tl/t-7DuNJSSuN3) with input graphs= of small/medium dimensions.

If I try to give in input to my giraph = program bigger input graphs (e.g. like http://snap.stanford.edu/data/we= b-NotreDame.html) in some cases I obtain many errors related to netty a= nd the yarn application FAILS, in other cases the yarn application remains = in a RUNNING UNDEFINED state (then I killed it instead of waiting the defau= lt timeout) without apparently no error. I also tried to use m5.4xlarge (64= GB, 16CPU) but I obtained same problems. I reported log errors of first cas= e here:

- log of errors by giraph worker on datanodes pasted here: = https://pastebi= n.com/CGHUd0za (same errors in all datanodes)
- log of errors by gir= aph master pasted here: https://pastebin.com/JXYN6y4L

I'm quite sure that e= rrors are not related to insufficient memory of EC2 instances because in th= e log I always saw messages like "(free/total/max) =3D 23038.28M / 272= 32.00M / 27232.00M". Please help me because my master thesis is blo= cked with this problem :-(

This is an example of command that I = used to run giraph, could you please check if parameters that I used are co= rrect? Any other tuning will be appreciated!

giraph 4Profiles-0.0.1-= SNAPSHOT.jar it.uniroma1.di.fourprofiles.worker.superstep0.gas1.Worker_Supe= rstep0_GAS1
-ca giraph.numComputeThreads=3D8 // Since t2.2xlarge has 8 = CORES, is it correct to set these parameters to 8?
-ca giraph.numInputT= hreads=3D8
-ca giraph.numOutputThreads=3D8

-w 8 // I set 8 work= ers since:
=C2=A0=C2=A0=C2=A0 =C2=A0//=C2=A0=C2=A0=C2=A0 - 5 datanodes o= n EC2"
=C2=A0=C2=A0=C2=A0 =C2=A0//=C2=A0=C2=A0=C2=A0 - every datano= de configured for max 2 containers in order to reduce messages between data= nodes
=C2=A0=C2=A0=C2=A0 =C2=A0//=C2=A0=C2=A0=C2=A0 - 2 containers are r= eserved for application master and giraph master
=C2=A0=C2=A0=C2=A0 =C2= =A0//=C2=A0=C2=A0=C2=A0 - (5 datanodes * 2 max containers) - 2 reserved =3D= 8 workers
=C2=A0=C2=A0=C2=A0 =C2=A0// Is it correct as reasoning?
-yh 15360 // I set 15360 since it corresponds to
=C2=A0=C2=A0=C2=A0 = =C2=A0=C2=A0=C2=A0 =C2=A0 // - yarn.scheduler.minimum-allocation-mb propert= y in yarn-site.xml
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0 // - map= reduce.map.memory.mb property in mapred-site.xml
=C2=A0=C2=A0=C2=A0 =C2= =A0=C2=A0=C2=A0 =C2=A0 // Is it correct as reasoning?

-ca giraph.pur= e.yarn.job=3Dtrue
-mc it.uniroma1.di.fourprofiles.master.Master_FourPro= files
-ca io.edge.reverse.duplicator=3Dtrue
-eif it.uniroma1.di.fou= rprofiles.io.format.IntEdgeData_TextEdgeInputFormat_ReverseEdgeDuplicator <= br>-eip INPUT_GRAPHS/HU_edges.txt-processed
-vof org.apache.giraph.io.f= ormats.IdWithValueTextOutputFormat
-op output
-ca giraph.SplitMaste= rWorker=3Dtrue
-ca giraph.messageCombinerClass=3Dit.uniroma1.di.fourpro= files.worker.msgcombiner.Worker_MsgCombiner
-ca giraph.master.observers= =3Dit.uniroma1.di.fourprofiles.master.observer.Observer_FourProfiles
-c= a giraph.metrics.enable=3Dtrue
-ca giraph.useInputSplitLocality=3Dfalse=
-ca giraph.useBigDataIOForMessages=3Dtrue
-ca giraph.useMessageSiz= eEncoding=3Dtrue
-ca giraph.oneToAllMsgSending=3Dtrue
-ca giraph.is= StaticGraph=3Dtrue

Furthermore I tried to use following netty param= eters but I didn't resolve the problems. Could you please help me if I = miss some important parameter or maybe I used it in a wrong way? I tried to= generalize the value passed to netty parameters with a trivial formula net= tyFactor*defaultValue where nettyFactor can be 1, 2, 3, ... (passed by shel= l parameter)

-ca giraph.nettyAutoRead=3Dtrue
-ca giraph.channels= PerServer=3D$((nettyFactor*1))
-ca giraph.nettyClientThreads=3D$((netty= Factor*4))
-ca giraph.nettyClientExecutionThreads=3D$((nettyFactor*8)) =
-ca giraph.nettyServerThreads=3D$((nettyFactor*16))
-ca giraph.nett= yServerExecutionThreads=3D$((nettyFactor*8))
-ca giraph.clientSendBuffe= rSize=3D$((nettyFactor*524288))
-ca giraph.clientReceiveBufferSize=3D$(= (nettyFactor*32768))
-ca giraph.serverSendBufferSize=3D$((nettyFactor*3= 2768))
-ca giraph.serverReceiveBufferSize=3D$((nettyFactor*524288)) -ca giraph.vertexRequestSize=3D$((nettyFactor*524288))
-ca giraph.edge= RequestSize=3D$((nettyFactor*524288))
-ca giraph.msgRequestSize=3D$((ne= ttyFactor*524288))
-ca giraph.nettyRequestEncoderBufferSize=3D$((nettyF= actor*32768))




... I have other questions:
1)
This= is my hadoop configuration https://we.tl/t-t1ItNYFe7H Please check it but I'm quit= e sure that is correct. I have only a question about it: since giraph does = not use "reduce", is it correct to assing 0 MB to mapreduce.reduc= e.memory.mb in mapred-site.xml?

2)
In order to avoid ClassNotFoun= dException error I copied the jar of my giraph application and all giraph j= ars from $GIRAPH_HOME and $GIRAPH_HOME/lib to $HADOOP_HOME/share/hadoop/yar= n/lib. Is there a better solution?

3)
Last but not least: Here https://we.tl/t-tdhu= ZFsVJW you can find the completed hadoop/yarn log of my giraph program = with following graph http://snap.stanford.edu/data/web-NotreDame.html as input. In this case the yarn application reamins in RUNNING UNDEFINED= state.

--000000000000908e4d057cb6b44c--