hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Kapoor <rez...@hortonworks.com>
Subject Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?
Date Thu, 09 Feb 2012 03:45:12 GMT
Your general procedure sounds correct (i.e. dropping your newly built .jar
into $HD_HOME/lib/), but to make sure it's getting picked up, you should
explicitly add $HD_HOME/lib/ to your exported HADOOP_CLASSPATH environment
variable; here's mine, as an example:

export HADOOP_CLASSPATH=".:./build/*.jar"

About your second point, you certainly need to copy this newly patched .jar
to every node in your cluster, because my patch changes the value of a
couple metrics emitted TO gmetad (FROM all the nodes in the cluster), so
without copying it over to every node in the cluster, gmetad will still
likely receive some bad metrics.

Varun

On Wed, Feb 8, 2012 at 6:19 PM, Merto Mertek <masmertoz@gmail.com> wrote:

> I will need your help. Please confirm if the following procedure is right.
> I have a dev environment where I pimp my scheduler (no hadoop running) and
> a small cluster environment where the changes(jars) are deployed with some
> scripts,  however I have never compiled the whole hadoop from source so I
> do not know if I am doing it right. I' ve done it as follow:
>
> a) apply a patch
> b) cd $HD_HOME; ant
> c) copy $HD_HOME/*build*/patched-core-hadoop.jar -> cluster:/$HD_HOME/*lib*
> d) run $HD_HOME/bin/start-all.sh
>
> Is this enough? When I tried to test "hadoop dfs -ls /" I could see that a
> new jar was not loaded and instead a jar from
> $HD_HOME/*share*/hadoop-20.205.0.jar
> was taken..
> Should I copy the entire hadoop folder to all nodes and reconfigure the
> entire cluster for the new build, or is enough if I configure it just on
> the node where gmetad will run?
>
>
>
>
>
>
> On 8 February 2012 06:33, Varun Kapoor <reznor@hortonworks.com> wrote:
>
> > I'm so sorry, Merto - like a silly goose, I attached the 2 patches to my
> > reply, and of course the mailing list did not accept the attachment.
> >
> > I plan on opening JIRAs for this tomorrow, but till then, here are links
> to
> > the 2 patches (from my Dropbox account):
> >
> >   - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.Hadoop.patch
> >   - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.gmetad.patch
> >
> > Here's hoping this works for you,
> >
> > Varun
> > On Tue, Feb 7, 2012 at 6:00 PM, Merto Mertek <masmertoz@gmail.com>
> wrote:
> >
> > > Varun, have I missed your link to the patches? I have tried to search
> > them
> > > on jira but I did not find them.. Can you repost the link for these two
> > > patches?
> > >
> > > Thank you..
> > >
> > > On 7 February 2012 20:36, Varun Kapoor <reznor@hortonworks.com> wrote:
> > >
> > > > I'm sorry to hear that gmetad cores continuously for you guys. Since
> > I'm
> > > > not seeing that behavior, I'm going to just put out the 2 possible
> > > patches
> > > > you could apply and wait to hear back from you. :)
> > > >
> > > > Option 1
> > > >
> > > > * Apply gmetadBufferOverflow.Hadoop.patch to the relevant file (
> > > >
> > >
> >
> http://svn.apache.org/viewvc/hadoop/common/branches/branch-1/src/core/org/apache/hadoop/metrics2/util/SampleStat.java?view=markupinmysetup)
> in your Hadoop sources and rebuild Hadoop.
> > > >
> > > > Option 2
> > > >
> > > > * Apply gmetadBufferOverflow.gmetad.patch to gmetad/process_xml.c and
> > > > rebuild gmetad.
> > > >
> > > > Only 1 of these 2 fixes is required, and it would help me if you
> could
> > > > first try Option 1 and let me know if that fixes things for you.
> > > >
> > > > Varun
> > > >
> > > > On Mon, Feb 6, 2012 at 10:36 PM, mete <efkarr@gmail.com> wrote:
> > > >
> > > >> Same with Merto's situation here, it always overflows short time
> after
> > > the
> > > >> restart. Without the hadoop metrics enabled everything is smooth.
> > > >> Regards
> > > >>
> > > >> Mete
> > > >>
> > > >> On Tue, Feb 7, 2012 at 4:58 AM, Merto Mertek <masmertoz@gmail.com>
> > > wrote:
> > > >>
> > > >> > I have tried to run it but it repeats crashing..
> > > >> >
> > > >> >  - When you start gmetad and Hadoop is not emitting metrics,
> > > everything
> > > >> > >   is peachy.
> > > >> > >
> > > >> >
> > > >> > Right, running just ganglia without running hadoop jobs seems
> stable
> > > >> for at
> > > >> > least a day..
> > > >> >
> > > >> >
> > > >> > >   - When you start Hadoop (and it thus starts emitting metrics),
> > > >> gmetad
> > > >> > >   cores.
> > > >> > >
> > > >> >
> > > >> > True, with a  following error : *** stack smashing detected ***:
> > > gmetad
> > > >> > terminated \n Segmentation fault
> > > >> >
> > > >> >     - On my MacBookPro, it's a SIGABRT due to a buffer overflow.
> > > >> > >
> > > >> > > I believe this is happening for everyone. What I would like
for
> > you
> > > to
> > > >> > try
> > > >> > > out are the following 2 scenarios:
> > > >> > >
> > > >> > >   - Once gmetad cores, if you start it up again, does it
core
> > again?
> > > >> Does
> > > >> > >   this process repeat ad infinitum?
> > > >> > >
> > > >> >     - On my MBP, the core is a one-time thing, and restarting
> gmetad
> > > >> > >      after the first core makes things run perfectly smoothly.
> > > >> > >         - I know others are saying this core occurs
> continuously,
> > > but
> > > >> > they
> > > >> > >         were all using ganglia-3.1.x, and I'm interested
in how
> > > >> > > ganglia-3.2.0
> > > >> > >         behaves for you.
> > > >> > >
> > > >> >
> > > >> > It cores everytime I run it. The difference is just that
> sometimes a
> > > >> > segmentation faults appears instantly, and sometimes it appears
> > after
> > > a
> > > >> > random time...lets say after a minute of running gmetad and
> > collecting
> > > >> > data.
> > > >> >
> > > >> >
> > > >> > >         - If you start Hadoop first (so gmetad is not running
> when
> > > the
> > > >> > >   first batch of Hadoop metrics are emitted) and THEN start
> gmetad
> > > >> after
> > > >> > a
> > > >> > >   few seconds, do you still see gmetad coring?
> > > >> > >
> > > >> >
> > > >> > Yes
> > > >> >
> > > >> >
> > > >> > >      - On my MBP, this sequence works perfectly fine, and
there
> > are
> > > no
> > > >> > >      gmetad cores whatsoever.
> > > >> > >
> > > >> >
> > > >> > I have tested this scenario with 2 working nodes so two gmond
plus
> > the
> > > >> head
> > > >> > gmond on the server where gmetad is located. I have checked and
> all
> > of
> > > >> them
> > > >> > are versioned 3.2.0.
> > > >> >
> > > >> > Hope it helps..
> > > >> >
> > > >> >
> > > >> >
> > > >> > >
> > > >> > > Bear in mind that this only addresses the gmetad coring
issue -
> > the
> > > >> > > warnings emitted about '4.9E-324' being out of range will
> > continue,
> > > >> but I
> > > >> > > know what's causing that as well (and hope that my patch
fixes
> it
> > > for
> > > >> > > free).
> > > >> > >
> > > >> > > Varun
> > > >> > > On Mon, Feb 6, 2012 at 2:39 PM, Merto Mertek <
> masmertoz@gmail.com
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > > > Yes I am encoutering the same problems and like Mete
said  few
> > > >> seconds
> > > >> > > > after restarting a segmentation fault appears.. here
is my
> > conf..
> > > >> > > > <http://pastebin.com/VgBjp08d>
> > > >> > > >
> > > >> > > > And here are some info from /var/log/messages (ubuntu
server
> > > 10.10):
> > > >> > > >
> > > >> > > > kernel: [424447.140641] gmetad[26115] general protection
> > > >> > ip:7f7762428fdb
> > > >> > > > > sp:7f776362d370 error:0 in libgcc_s.so.1[7f776241a000+15000]
> > > >> > > > >
> > > >> > > >
> > > >> > > > When I compiled gmetad I used the following command:
> > > >> > > >
> > > >> > > > ./configure --with-gmetad --sysconfdir=/etc/ganglia
> > > >> > > > > CPPFLAGS="-I/usr/local/rrdtool-1.4.7/include"
> > > >> > > > > CFLAGS="-I/usr/local/rrdtool-1.4.7/include"
> > > >> > > > > LDFLAGS="-L/usr/local/rrdtool-1.4.7/lib"
> > > >> > > > >
> > > >> > > >
> > > >> > > > The same was tried with rrdtool 1.4.5. My current ganglia
> > version
> > > is
> > > >> > > 3.2.0
> > > >> > > > and like Mete I tried it with version 3.1.7 but without
> > success..
> > > >> > > >
> > > >> > > > Hope we will sort it out soon any solution..
> > > >> > > > thank you
> > > >> > > >
> > > >> > > >
> > > >> > > > On 6 February 2012 20:09, mete <efkarr@gmail.com>
wrote:
> > > >> > > >
> > > >> > > > > Hello,
> > > >> > > > > i also face this issue when using GangliaContext31
and
> > > >> hadoop-1.0.0,
> > > >> > > and
> > > >> > > > > ganglia 3.1.7 (also tried 3.1.2). I continuously
get buffer
> > > >> overflows
> > > >> > > as
> > > >> > > > > soon as i restart the gmetad.
> > > >> > > > > Regards
> > > >> > > > > Mete
> > > >> > > > >
> > > >> > > > > On Mon, Feb 6, 2012 at 7:42 PM, Vitthal "Suhas"
Gogate <
> > > >> > > > > gogate@hortonworks.com> wrote:
> > > >> > > > >
> > > >> > > > > > I assume you have seen the following information
on Hadoop
> > > >> twiki,
> > > >> > > > > > http://wiki.apache.org/hadoop/GangliaMetrics
> > > >> > > > > >
> > > >> > > > > > So do you use GangliaContext31 in
> > hadoop-metrics2.properties?
> > > >> > > > > >
> > > >> > > > > > We use Ganglia 3.2 with Hadoop 20.205  and
works fine (I
> > > >> remember
> > > >> > > > seeing
> > > >> > > > > > gmetad sometime goes down due to buffer overflow
problem
> > when
> > > >> > hadoop
> > > >> > > > > starts
> > > >> > > > > > pumping in the metrics.. but restarting works..
let me
> know
> > if
> > > >> you
> > > >> > > face
> > > >> > > > > > same problem?
> > > >> > > > > >
> > > >> > > > > > --Suhas
> > > >> > > > > >
> > > >> > > > > > Additionally, the Ganglia protocol change
significantly
> > > between
> > > >> > > Ganglia
> > > >> > > > > 3.0
> > > >> > > > > > and Ganglia 3.1 (i.e., Ganglia 3.1 is not
compatible with
> > > >> Ganglia
> > > >> > 3.0
> > > >> > > > > > clients). This caused Hadoop to not work
with Ganglia 3.1;
> > > there
> > > >> > is a
> > > >> > > > > patch
> > > >> > > > > > available for this, HADOOP-4675. As of November
2010, this
> > > patch
> > > >> > has
> > > >> > > > been
> > > >> > > > > > rolled into the mainline for 0.20.2 and later.
To use the
> > > >> Ganglia
> > > >> > 3.1
> > > >> > > > > > protocol in place of the 3.0, substitute
> > > >> > > > > > org.apache.hadoop.metrics.ganglia.GangliaContext31
for
> > > >> > > > > > org.apache.hadoop.metrics.ganglia.GangliaContext
in the
> > > >> > > > > > hadoop-metrics.properties lines above.
> > > >> > > > > >
> > > >> > > > > > On Fri, Feb 3, 2012 at 1:07 PM, Merto Mertek
<
> > > >> masmertoz@gmail.com>
> > > >> > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > I spent a lot of time to figure it out
however i did not
> > > find
> > > >> a
> > > >> > > > > solution.
> > > >> > > > > > > Problems from the logs pointed me for
some bugs in
> > rrdupdate
> > > >> > tool,
> > > >> > > > > > however
> > > >> > > > > > > i tried to solve it with different versions
of ganglia
> and
> > > >> > rrdtool
> > > >> > > > but
> > > >> > > > > > the
> > > >> > > > > > > error is the same. Segmentation fault
appears after the
> > > >> following
> > > >> > > > > lines,
> > > >> > > > > > if
> > > >> > > > > > > I run gmetad in debug mode...
> > > >> > > > > > >
> > > >> > > > > > > "Created rrd
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd"
> > > >> > > > > > > "Created rrd
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.snapshot_max_time.rrd
> > > >> > > > > > > "
> > > >> > > > > > >
> > > >> > > > > > > which I suppose are generated from
> MetricsSystemImpl.java
> > > (Is
> > > >> > there
> > > >> > > > any
> > > >> > > > > > way
> > > >> > > > > > > just to disable this two metrics?)
> > > >> > > > > > >
> > > >> > > > > > > From the /var/log/messages there are
a lot of errors:
> > > >> > > > > > >
> > > >> > > > > > > "xxx gmetad[15217]: RRD_update
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.publish_imax_time.rrd):
> > > >> > > > > > > converting  '4.9E-324' to float: Numerical
result out of
> > > >> range"
> > > >> > > > > > > "xxx gmetad[15217]: RRD_update
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.snapshot_imax_time.rrd):
> > > >> > > > > > > converting  '4.9E-324' to float: Numerical
result out of
> > > >> range"
> > > >> > > > > > >
> > > >> > > > > > > so probably there are some converting
issues ? Where
> > should
> > > I
> > > >> > look
> > > >> > > > for
> > > >> > > > > > the
> > > >> > > > > > > solution? Would you rather suggest to
use ganglia 3.0.x
> > with
> > > >> the
> > > >> > > old
> > > >> > > > > > > protocol and leave the version >3.1
for further
> releases?
> > > >> > > > > > >
> > > >> > > > > > > any help is realy appreciated...
> > > >> > > > > > >
> > > >> > > > > > > On 1 February 2012 04:04, Merto Mertek
<
> > masmertoz@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > I would be glad to hear that too..
I've setup the
> > > following:
> > > >> > > > > > > >
> > > >> > > > > > > > Hadoop 0.20.205
> > > >> > > > > > > > Ganglia Front  3.1.7
> > > >> > > > > > > > Ganglia Back *(gmetad)* 3.1.7
> > > >> > > > > > > > RRDTool <http://www.rrdtool.org/>
1.4.5. -> i had
> some
> > > >> > troubles
> > > >> > > > > > > > installing 1.4.4
> > > >> > > > > > > >
> > > >> > > > > > > > Ganglia works just in case hadoop
is not running, so
> > > metrics
> > > >> > are
> > > >> > > > not
> > > >> > > > > > > > publshed to gmetad node (conf with
new
> > > >> > > > hadoop-metrics2.proprieties).
> > > >> > > > > > When
> > > >> > > > > > > > hadoop is started, a segmentation
fault appears in
> > gmetad
> > > >> > deamon:
> > > >> > > > > > > >
> > > >> > > > > > > > sudo gmetad -d 2
> > > >> > > > > > > > .......
> > > >> > > > > > > > Updating host xxx, metric dfs.FSNamesystem.BlocksTotal
> > > >> > > > > > > > Updating host xxx, metric bytes_in
> > > >> > > > > > > > Updating host xxx, metric bytes_out
> > > >> > > > > > > > Updating host xxx, metric
> > > >> > > > > metricssystem.MetricsSystem.publish_max_time
> > > >> > > > > > > > Created rrd
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> /var/lib/ganglia/rrds/hdcluster/hadoopmaster/metricssystem.MetricsSystem.publish_max_time.rrd
> > > >> > > > > > > > Segmentation fault
> > > >> > > > > > > >
> > > >> > > > > > > > And some info from the apache log
<
> > > >> > http://pastebin.com/nrqKRtKJ
> > > >> > > >..
> > > >> > > > > > > >
> > > >> > > > > > > > Can someone suggest a ganglia version
that is tested
> > with
> > > >> > hadoop
> > > >> > > > > > > 0.20.205?
> > > >> > > > > > > > I will try to sort it out however
it seems a not so
> > > tribial
> > > >> > > > problem..
> > > >> > > > > > > >
> > > >> > > > > > > > Thank you
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > On 2 December 2011 12:32, praveenesh
kumar <
> > > >> > praveenesh@gmail.com
> > > >> > > >
> > > >> > > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > >> or Do I have to apply some
hadoop patch for this ?
> > > >> > > > > > > >>
> > > >> > > > > > > >> Thanks,
> > > >> > > > > > > >> Praveenesh
> > > >> > > > > > > >>
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > >
> > > > http://www.hadoopsummit.org/
> > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> >
> > http://www.hadoopsummit.org/
> >
>



-- 


http://www.hadoopsummit.org/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message