chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jiaqi Tan <tanji...@gmail.com>
Subject Re: units in MDL and HICC
Date Fri, 22 May 2009 12:49:47 GMT
Hi,

In my mind, part of the peculiarity is also that Java is supposed to
be platform-neutral, but proc is clearly not a platform neutral thing
(Windows?) and even different versions of Linux can implement
different versions of procfs.  In terms of licensing, sar is also
GPL'ed so I guess we can't just bundle some particular version of sar
with Chukwa and force everyone to use that. Even then, that wouldn't
work with non-Unix systems. As a sidenote, I think the Demux
processors are "independent" of semantics only because nobody (as in,
it's not meant for human presentation/consumption until past the MDL)
looks at the output from Demux.

In fact, I did have another thought--it seems the Exec adaptor is
currently also picking up column names from sar, and the processor
does see them, maybe we could stick the column names in as some form
of metadata and have the MDL consider the column names to determine if
any unit conversions are necessary?

Jiaqi

On Thu, May 21, 2009 at 11:01 PM, Ariel Rabkin <asrabkin@gmail.com> wrote:
> Howdy.
>
> I agree with your diagnosis -- this is the peril of external
> dependencies.  There was discussion, back in the day, about doing
> something better.  Poking at /proc is certainly one option. Another
> would be finding some apache-licensed library that does this. Sigar
> would fit the bill, but it's GPLed and so we can't link against it.
> Though there was discussion under HADOOP-4959 about a license
> exemption. That might solve our problem.
>
> There's a Java standard approach that does some subset of what we want
> -- http://java.sun.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html
>
> What's peculiar about this issue is that right now, the actual Demux
> processors are largely independent of the versions -- those processors
> make assumptions about the syntax of the input, but almost none about
> the semantics. If the data comes in columns with headers, they do
> basically the right thing.  However, when it comes time to do the
> database insert, the column names don't match the ones in mdl.xml, and
> so things start to fail.
>
> It seems a pity to dirty up the currently clean Java code with lots of
> special cases for canonical-izing data formats.  I'm okay  doing some
> sort of parameterization, but I think in a lot of cases we can do
> something very simpleminded and still be okay.  Perhaps as simple as
> "if you see field x in a SystemMetrics record, output field y as
> follows."
>
> On Thu, May 21, 2009 at 10:27 PM, Jiaqi Tan <tanjiaqi@gmail.com> wrote:
>> Hi Ari,
>>
>> I think the real problem here is that sar metrics are being picked up
>> by an Exec adaptor which calls sar and there's no control over which
>> sar gets called (or at least not right now), and sar is ultimately an
>> external dependency which currently is just assumed to be sitting
>> there.
>>
>> Also, sar just directly emits unstructured plain text, so there's no
>> self-describing data format a la some XML which says what the units
>> are, so if sar is changing output units and stuff, then the parser in
>> the Demux needs to take care of that too. Even more generally, even
>> any change at all to sar's output would require an update of the
>> Demux.
>>
>> I think the fundamental problem is that having an Exec adaptor which
>> pulls the unstructured output of an external program and having a
>> Demux processor that makes assumptions about what that output looks
>> like and what it means, makes the whole workflow dependent on
>> something not under the control of Chukwa.
>>
>> I can imagine one way of working around that would be to not use sar
>> and write custom parsers for /proc so that Chukwa is itself aware of
>> what the proc data actually means without having to make assumptions
>> about the output of an external parser; it's reinventing the wheel
>> somewhat but it gives an end-to-end cleaner solution.
>>
>> The other answer would perhaps be the "web services" answer of having
>> a whole standardized way of passing data around in a structured way
>> but then that starts to look like a generalized pub/sub system.
>>
>> But in the meantime maybe the sar version on the system being
>> monitored could be picked up in some way (metadata in the Chunk?) and
>> the various Demux processors dependent on such external programs e.g.
>> IoStat, Df, etc. could be parameterized to handle output from
>> different versions/variants of the source program. Or to be even more
>> general, the Exec adaptor could send along an MD5 hash of the program
>> it's calling, and then you'd have a whole bunch of processors for
>> every possible variant of the program you want to support; that sounds
>> terribly hackish to me but I think that way at least the identity of
>> the external dependency can be identified.
>>
>
> --
> Ari Rabkin asrabkin@gmail.com
> UC Berkeley Computer Science Department
>

Mime
View raw message