hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Kapoor <rez...@hortonworks.com>
Subject Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?
Date Fri, 10 Feb 2012 16:10:45 GMT
Hey Merto,

Any luck getting the patch running on your cluster?

In case you're interested, there's now a JIRA for this:
https://issues.apache.org/jira/browse/HADOOP-8052.

Varun

On Wed, Feb 8, 2012 at 7:45 PM, Varun Kapoor <reznor@hortonworks.com> wrote:

> Your general procedure sounds correct (i.e. dropping your newly built .jar
> into $HD_HOME/lib/), but to make sure it's getting picked up, you should
> explicitly add $HD_HOME/lib/ to your exported HADOOP_CLASSPATH environment
> variable; here's mine, as an example:
>
> export HADOOP_CLASSPATH=".:./build/*.jar"
>
> About your second point, you certainly need to copy this newly patched
> .jar to every node in your cluster, because my patch changes the value of a
> couple metrics emitted TO gmetad (FROM all the nodes in the cluster), so
> without copying it over to every node in the cluster, gmetad will still
> likely receive some bad metrics.
>
> Varun
>
>
> On Wed, Feb 8, 2012 at 6:19 PM, Merto Mertek <masmertoz@gmail.com> wrote:
>
>> I will need your help. Please confirm if the following procedure is right.
>> I have a dev environment where I pimp my scheduler (no hadoop running) and
>> a small cluster environment where the changes(jars) are deployed with some
>> scripts,  however I have never compiled the whole hadoop from source so I
>> do not know if I am doing it right. I' ve done it as follow:
>>
>> a) apply a patch
>> b) cd $HD_HOME; ant
>> c) copy $HD_HOME/*build*/patched-core-hadoop.jar ->
>> cluster:/$HD_HOME/*lib*
>> d) run $HD_HOME/bin/start-all.sh
>>
>> Is this enough? When I tried to test "hadoop dfs -ls /" I could see that a
>> new jar was not loaded and instead a jar from
>> $HD_HOME/*share*/hadoop-20.205.0.jar
>> was taken..
>> Should I copy the entire hadoop folder to all nodes and reconfigure the
>> entire cluster for the new build, or is enough if I configure it just on
>> the node where gmetad will run?
>>
>>
>>
>>
>>
>>
>> On 8 February 2012 06:33, Varun Kapoor <reznor@hortonworks.com> wrote:
>>
>> > I'm so sorry, Merto - like a silly goose, I attached the 2 patches to my
>> > reply, and of course the mailing list did not accept the attachment.
>> >
>> > I plan on opening JIRAs for this tomorrow, but till then, here are
>> links to
>> > the 2 patches (from my Dropbox account):
>> >
>> >   - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.Hadoop.patch
>> >   - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.gmetad.patch
>> >
>> > Here's hoping this works for you,
>> >
>> > Varun
>> > On Tue, Feb 7, 2012 at 6:00 PM, Merto Mertek <masmertoz@gmail.com>
>> wrote:
>> >
>> > > Varun, have I missed your link to the patches? I have tried to search
>> > them
>> > > on jira but I did not find them.. Can you repost the link for these
>> two
>> > > patches?
>> > >
>> > > Thank you..
>> > >
>> > > On 7 February 2012 20:36, Varun Kapoor <reznor@hortonworks.com>
>> wrote:
>> > >
>> > > > I'm sorry to hear that gmetad cores continuously for you guys. Since
>> > I'm
>> > > > not seeing that behavior, I'm going to just put out the 2 possible
>> > > patches
>> > > > you could apply and wait to hear back from you. :)
>> > > >
>> > > > Option 1
>> > > >
>> > > > * Apply gmetadBufferOverflow.Hadoop.patch to the relevant file (
>> > > >
>> > >
>> >
>> http://svn.apache.org/viewvc/hadoop/common/branches/branch-1/src/core/org/apache/hadoop/metrics2/util/SampleStat.java?view=markupinmysetup)
>> in your Hadoop sources and rebuild Hadoop.
>> > > >
>> > > > Option 2
>> > > >
>> > > > * Apply gmetadBufferOverflow.gmetad.patch to gmetad/process_xml.c
>> and
>> > > > rebuild gmetad.
>> > > >
>> > > > Only 1 of these 2 fixes is required, and it would help me if you
>> could
>> > > > first try Option 1 and let me know if that fixes things for you.
>> > > >
>> > > > Varun
>> > > >
>> > > > On Mon, Feb 6, 2012 at 10:36 PM, mete <efkarr@gmail.com> wrote:
>> > > >
>> > > >> Same with Merto's situation here, it always overflows short time
>> after
>> > > the
>> > > >> restart. Without the hadoop metrics enabled everything is smooth.
>> > > >> Regards
>> > > >>
>> > > >> Mete
>> > > >>
>> > > >> On Tue, Feb 7, 2012 at 4:58 AM, Merto Mertek <masmertoz@gmail.com>
>> > > wrote:
>> > > >>
>> > > >> > I have tried to run it but it repeats crashing..
>> > > >> >
>> > > >> >  - When you start gmetad and Hadoop is not emitting metrics,
>> > > everything
>> > > >> > >   is peachy.
>> > > >> > >
>> > > >> >
>> > > >> > Right, running just ganglia without running hadoop jobs seems
>> stable
>> > > >> for at
>> > > >> > least a day..
>> > > >> >
>> > > >> >
>> > > >> > >   - When you start Hadoop (and it thus starts emitting
>> metrics),
>> > > >> gmetad
>> > > >> > >   cores.
>> > > >> > >
>> > > >> >
>> > > >> > True, with a  following error : *** stack smashing detected
***:
>> > > gmetad
>> > > >> > terminated \n Segmentation fault
>> > > >> >
>> > > >> >     - On my MacBookPro, it's a SIGABRT due to a buffer overflow.
>> > > >> > >
>> > > >> > > I believe this is happening for everyone. What I would
like for
>> > you
>> > > to
>> > > >> > try
>> > > >> > > out are the following 2 scenarios:
>> > > >> > >
>> > > >> > >   - Once gmetad cores, if you start it up again, does
it core
>> > again?
>> > > >> Does
>> > > >> > >   this process repeat ad infinitum?
>> > > >> > >
>> > > >> >     - On my MBP, the core is a one-time thing, and restarting
>> gmetad
>> > > >> > >      after the first core makes things run perfectly
smoothly.
>> > > >> > >         - I know others are saying this core occurs
>> continuously,
>> > > but
>> > > >> > they
>> > > >> > >         were all using ganglia-3.1.x, and I'm interested
in how
>> > > >> > > ganglia-3.2.0
>> > > >> > >         behaves for you.
>> > > >> > >
>> > > >> >
>> > > >> > It cores everytime I run it. The difference is just that
>> sometimes a
>> > > >> > segmentation faults appears instantly, and sometimes it appears
>> > after
>> > > a
>> > > >> > random time...lets say after a minute of running gmetad and
>> > collecting
>> > > >> > data.
>> > > >> >
>> > > >> >
>> > > >> > >         - If you start Hadoop first (so gmetad is not
running
>> when
>> > > the
>> > > >> > >   first batch of Hadoop metrics are emitted) and THEN
start
>> gmetad
>> > > >> after
>> > > >> > a
>> > > >> > >   few seconds, do you still see gmetad coring?
>> > > >> > >
>> > > >> >
>> > > >> > Yes
>> > > >> >
>> > > >> >
>> > > >> > >      - On my MBP, this sequence works perfectly fine,
and there
>> > are
>> > > no
>> > > >> > >      gmetad cores whatsoever.
>> > > >> > >
>> > > >> >
>> > > >> > I have tested this scenario with 2 working nodes so two gmond
>> plus
>> > the
>> > > >> head
>> > > >> > gmond on the server where gmetad is located. I have checked
and
>> all
>> > of
>> > > >> them
>> > > >> > are versioned 3.2.0.
>> > > >> >
>> > > >> > Hope it helps..
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > >
>> > > >> > > Bear in mind that this only addresses the gmetad coring
issue -
>> > the
>> > > >> > > warnings emitted about '4.9E-324' being out of range
will
>> > continue,
>> > > >> but I
>> > > >> > > know what's causing that as well (and hope that my patch
fixes
>> it
>> > > for
>> > > >> > > free).
>> > > >> > >
>> > > >> > > Varun
>> > > >> > > On Mon, Feb 6, 2012 at 2:39 PM, Merto Mertek <
>> masmertoz@gmail.com
>> > >
>> > > >> > wrote:
>> > > >> > >
>> > > >> > > > Yes I am encoutering the same problems and like
Mete said
>>  few
>> > > >> seconds
>> > > >> > > > after restarting a segmentation fault appears..
here is my
>> > conf..
>> > > >> > > > <http://pastebin.com/VgBjp08d>
>> > > >> > > >
>> > > >> > > > And here are some info from /var/log/messages (ubuntu
server
>> > > 10.10):
>> > > >> > > >
>> > > >> > > > kernel: [424447.140641] gmetad[26115] general protection
>> > > >> > ip:7f7762428fdb
>> > > >> > > > > sp:7f776362d370 error:0 in
>> libgcc_s.so.1[7f776241a000+15000]
>> > > >> > > > >
>> > > >> > > >
>> > > >> > > > When I compiled gmetad I used the following command:
>> > > >> > > >
>> > > >> > > > ./configure --with-gmetad --sysconfdir=/etc/ganglia
>> > > >> > > > > CPPFLAGS="-I/usr/local/rrdtool-1.4.7/include"
>> > > >> > > > > CFLAGS="-I/usr/local/rrdtool-1.4.7/include"
>> > > >> > > > > LDFLAGS="-L/usr/local/rrdtool-1.4.7/lib"
>> > > >> > > > >
>> > > >> > > >
>> > > >> > > > The same was tried with rrdtool 1.4.5. My current
ganglia
>> > version
>> > > is
>> > > >> > > 3.2.0
>> > > >> > > > and like Mete I tried it with version 3.1.7 but
without
>> > success..
>> > > >> > > >
>> > > >> > > > Hope we will sort it out soon any solution..
>> > > >> > > > thank you
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > On 6 February 2012 20:09, mete <efkarr@gmail.com>
wrote:
>> > > >> > > >
>> > > >> > > > > Hello,
>> > > >> > > > > i also face this issue when using GangliaContext31
and
>> > > >> hadoop-1.0.0,
>> > > >> > > and
>> > > >> > > > > ganglia 3.1.7 (also tried 3.1.2). I continuously
get buffer
>> > > >> overflows
>> > > >> > > as
>> > > >> > > > > soon as i restart the gmetad.
>> > > >> > > > > Regards
>> > > >> > > > > Mete
>> > > >> > > > >
>> > > >> > > > > On Mon, Feb 6, 2012 at 7:42 PM, Vitthal "Suhas"
Gogate <
>> > > >> > > > > gogate@hortonworks.com> wrote:
>> > > >> > > > >
>> > > >> > > > > > I assume you have seen the following
information on
>> Hadoop
>> > > >> twiki,
>> > > >> > > > > > http://wiki.apache.org/hadoop/GangliaMetrics
>> > > >> > > > > >
>> > > >> > > > > > So do you use GangliaContext31 in
>> > hadoop-metrics2.properties?
>> > > >> > > > > >
>> > > >> > > > > > We use Ganglia 3.2 with Hadoop 20.205
 and works fine (I
>> > > >> remember
>> > > >> > > > seeing
>> > > >> > > > > > gmetad sometime goes down due to buffer
overflow problem
>> > when
>> > > >> > hadoop
>> > > >> > > > > starts
>> > > >> > > > > > pumping in the metrics.. but restarting
works.. let me
>> know
>> > if
>> > > >> you
>> > > >> > > face
>> > > >> > > > > > same problem?
>> > > >> > > > > >
>> > > >> > > > > > --Suhas
>> > > >> > > > > >
>> > > >> > > > > > Additionally, the Ganglia protocol change
significantly
>> > > between
>> > > >> > > Ganglia
>> > > >> > > > > 3.0
>> > > >> > > > > > and Ganglia 3.1 (i.e., Ganglia 3.1 is
not compatible with
>> > > >> Ganglia
>> > > >> > 3.0
>> > > >> > > > > > clients). This caused Hadoop to not work
with Ganglia
>> 3.1;
>> > > there
>> > > >> > is a
>> > > >> > > > > patch
>> > > >> > > > > > available for this, HADOOP-4675. As of
November 2010,
>> this
>> > > patch
>> > > >> > has
>> > > >> > > > been
>> > > >> > > > > > rolled into the mainline for 0.20.2 and
later. To use the
>> > > >> Ganglia
>> > > >> > 3.1
>> > > >> > > > > > protocol in place of the 3.0, substitute
>> > > >> > > > > > org.apache.hadoop.metrics.ganglia.GangliaContext31
for
>> > > >> > > > > > org.apache.hadoop.metrics.ganglia.GangliaContext
in the
>> > > >> > > > > > hadoop-metrics.properties lines above.
>> > > >> > > > > >
>> > > >> > > > > > On Fri, Feb 3, 2012 at 1:07 PM, Merto
Mertek <
>> > > >> masmertoz@gmail.com>
>> > > >> > > > > wrote:
>> > > >> > > > > >
>> > > >> > > > > > > I spent a lot of time to figure
it out however i did
>> not
>> > > find
>> > > >> a
>> > > >> > > > > solution.
>> > > >> > > > > > > Problems from the logs pointed me
for some bugs in
>> > rrdupdate
>> > > >> > tool,
>> > > >> > > > > > however
>> > > >> > > > > > > i tried to solve it with different
versions of ganglia
>> and
>> > > >> > rrdtool
>> > > >> > > > but
>> > > >> > > > > > the
>> > > >> > > > > > > error is the same. Segmentation
fault appears after the
>> > > >> following
>> > > >> > > > > lines,
>> > > >> > > > > > if
>> > > >> > > > > > > I run gmetad in debug mode...
>> > > >> > > > > > >
>> > > >> > > > > > > "Created rrd
>> > > >> > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.publish_max_time.rrd"
>> > > >> > > > > > > "Created rrd
>> > > >> > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> /var/lib/ganglia/rrds/hdcluster/xxx/metricssystem.MetricsSystem.snapshot_max_time.rrd
>> > > >> > > > > > > "
>> > > >> > > > > > >
>> > > >> > > > > > > which I suppose are generated from
>> MetricsSystemImpl.java
>> > > (Is
>> > > >> > there
>> > > >> > > > any
>> > > >> > > > > > way
>> > > >> > > > > > > just to disable this two metrics?)
>> > > >> > > > > > >
>> > > >> > > > > > > From the /var/log/messages there
are a lot of errors:
>> > > >> > > > > > >
>> > > >> > > > > > > "xxx gmetad[15217]: RRD_update
>> > > >> > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.publish_imax_time.rrd):
>> > > >> > > > > > > converting  '4.9E-324' to float:
Numerical result out
>> of
>> > > >> range"
>> > > >> > > > > > > "xxx gmetad[15217]: RRD_update
>> > > >> > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> (/var/lib/ganglia/rrds/hdc/xxx/metricssystem.MetricsSystem.snapshot_imax_time.rrd):
>> > > >> > > > > > > converting  '4.9E-324' to float:
Numerical result out
>> of
>> > > >> range"
>> > > >> > > > > > >
>> > > >> > > > > > > so probably there are some converting
issues ? Where
>> > should
>> > > I
>> > > >> > look
>> > > >> > > > for
>> > > >> > > > > > the
>> > > >> > > > > > > solution? Would you rather suggest
to use ganglia 3.0.x
>> > with
>> > > >> the
>> > > >> > > old
>> > > >> > > > > > > protocol and leave the version >3.1
for further
>> releases?
>> > > >> > > > > > >
>> > > >> > > > > > > any help is realy appreciated...
>> > > >> > > > > > >
>> > > >> > > > > > > On 1 February 2012 04:04, Merto
Mertek <
>> > masmertoz@gmail.com
>> > > >
>> > > >> > > wrote:
>> > > >> > > > > > >
>> > > >> > > > > > > > I would be glad to hear that
too.. I've setup the
>> > > following:
>> > > >> > > > > > > >
>> > > >> > > > > > > > Hadoop 0.20.205
>> > > >> > > > > > > > Ganglia Front  3.1.7
>> > > >> > > > > > > > Ganglia Back *(gmetad)* 3.1.7
>> > > >> > > > > > > > RRDTool <http://www.rrdtool.org/>
1.4.5. -> i had
>> some
>> > > >> > troubles
>> > > >> > > > > > > > installing 1.4.4
>> > > >> > > > > > > >
>> > > >> > > > > > > > Ganglia works just in case
hadoop is not running, so
>> > > metrics
>> > > >> > are
>> > > >> > > > not
>> > > >> > > > > > > > publshed to gmetad node (conf
with new
>> > > >> > > > hadoop-metrics2.proprieties).
>> > > >> > > > > > When
>> > > >> > > > > > > > hadoop is started, a segmentation
fault appears in
>> > gmetad
>> > > >> > deamon:
>> > > >> > > > > > > >
>> > > >> > > > > > > > sudo gmetad -d 2
>> > > >> > > > > > > > .......
>> > > >> > > > > > > > Updating host xxx, metric
>> dfs.FSNamesystem.BlocksTotal
>> > > >> > > > > > > > Updating host xxx, metric bytes_in
>> > > >> > > > > > > > Updating host xxx, metric bytes_out
>> > > >> > > > > > > > Updating host xxx, metric
>> > > >> > > > > metricssystem.MetricsSystem.publish_max_time
>> > > >> > > > > > > > Created rrd
>> > > >> > > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> /var/lib/ganglia/rrds/hdcluster/hadoopmaster/metricssystem.MetricsSystem.publish_max_time.rrd
>> > > >> > > > > > > > Segmentation fault
>> > > >> > > > > > > >
>> > > >> > > > > > > > And some info from the apache
log <
>> > > >> > http://pastebin.com/nrqKRtKJ
>> > > >> > > >..
>> > > >> > > > > > > >
>> > > >> > > > > > > > Can someone suggest a ganglia
version that is tested
>> > with
>> > > >> > hadoop
>> > > >> > > > > > > 0.20.205?
>> > > >> > > > > > > > I will try to sort it out however
it seems a not so
>> > > tribial
>> > > >> > > > problem..
>> > > >> > > > > > > >
>> > > >> > > > > > > > Thank you
>> > > >> > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > > > On 2 December 2011 12:32, praveenesh
kumar <
>> > > >> > praveenesh@gmail.com
>> > > >> > > >
>> > > >> > > > > > wrote:
>> > > >> > > > > > > >
>> > > >> > > > > > > >> or Do I have to apply some
hadoop patch for this ?
>> > > >> > > > > > > >>
>> > > >> > > > > > > >> Thanks,
>> > > >> > > > > > > >> Praveenesh
>> > > >> > > > > > > >>
>> > > >> > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > >
>> > > >
>> > > > http://www.hadoopsummit.org/
>> > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> >
>> >
>> > http://www.hadoopsummit.org/
>> >
>>
>
>
>
> --
>
>
> http://www.hadoopsummit.org/
>
>


-- 


http://www.hadoopsummit.org/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message