From Jiaqi Tan <>
Subject Re: units in MDL and HICC
Date Fri, 22 May 2009 18:01:05 GMT
I think it's fine to have the semantics remain in the Demux, in that
case perhaps the Demux processors can look at the sar-generated column
labels to determine what are the units, and to standardize the units
of the output?


On Fri, May 22, 2009 at 10:56 AM, Eric Yang <> wrote:
> Many solutions have been suggested in the past year, but there isn't one
> fits all.  Most of the promising library is in the GPL camp.  Unfortunately,
> we can't use those.  The closest thing in the apache camp is the ganglia
> metrics library.  There are 2 bugs that they need to fix in the metrics
> library.  First, it uses float to store all values, hence the accuracy
> becomes somewhat questionable for large values.  Second, one of the metrics
> only include value from first device.  I forget it's either network device
> or disk.  I dropped integration of ganglia metrics library after discovering
> those bugs.  However, we might want to revisit this, if it has been
> improved.  For the windows camp, we may need a completely different solution
> for measuring system metrics.
> I believe all parsing logic and data schematics should happen in demux
> parser rather than MDL.  Personally, I believe MDL should have zero
> configuration.  MDL's purpose is to load data into database by knowing the
> RecordType=Table, Key=Column, Value=Value.  This will definitely reduce the
> places that we maintain data transformation.  The data schematics should
> happen in demux parser, and database_create_table.sql only.  What do you
> guys think?
> Regards,
> Eric
> On 5/21/09 11:01 PM, "Ariel Rabkin" <> wrote:
>> Howdy.
>> I agree with your diagnosis -- this is the peril of external
>> dependencies.  There was discussion, back in the day, about doing
>> something better.  Poking at /proc is certainly one option. Another
>> would be finding some apache-licensed library that does this. Sigar
>> would fit the bill, but it's GPLed and so we can't link against it.
>> Though there was discussion under HADOOP-4959 about a license
>> exemption. That might solve our problem.
>> There's a Java standard approach that does some subset of what we want
>> --
>> ent/UnixOperatingSystemMXBean.html
>> What's peculiar about this issue is that right now, the actual Demux
>> processors are largely independent of the versions -- those processors
>> make assumptions about the syntax of the input, but almost none about
>> the semantics. If the data comes in columns with headers, they do
>> basically the right thing.  However, when it comes time to do the
>> database insert, the column names don't match the ones in mdl.xml, and
>> so things start to fail.
>> It seems a pity to dirty up the currently clean Java code with lots of
>> special cases for canonical-izing data formats.  I'm okay  doing some
>> sort of parameterization, but I think in a lot of cases we can do
>> something very simpleminded and still be okay.  Perhaps as simple as
>> "if you see field x in a SystemMetrics record, output field y as
>> follows."
>> On Thu, May 21, 2009 at 10:27 PM, Jiaqi Tan <> wrote:
>>> Hi Ari,
>>> I think the real problem here is that sar metrics are being picked up
>>> by an Exec adaptor which calls sar and there's no control over which
>>> sar gets called (or at least not right now), and sar is ultimately an
>>> external dependency which currently is just assumed to be sitting
>>> there.
>>> Also, sar just directly emits unstructured plain text, so there's no
>>> self-describing data format a la some XML which says what the units
>>> are, so if sar is changing output units and stuff, then the parser in
>>> the Demux needs to take care of that too. Even more generally, even
>>> any change at all to sar's output would require an update of the
>>> Demux.
>>> I think the fundamental problem is that having an Exec adaptor which
>>> pulls the unstructured output of an external program and having a
>>> Demux processor that makes assumptions about what that output looks
>>> like and what it means, makes the whole workflow dependent on
>>> something not under the control of Chukwa.
>>> I can imagine one way of working around that would be to not use sar
>>> and write custom parsers for /proc so that Chukwa is itself aware of
>>> what the proc data actually means without having to make assumptions
>>> about the output of an external parser; it's reinventing the wheel
>>> somewhat but it gives an end-to-end cleaner solution.
>>> The other answer would perhaps be the "web services" answer of having
>>> a whole standardized way of passing data around in a structured way
>>> but then that starts to look like a generalized pub/sub system.
>>> But in the meantime maybe the sar version on the system being
>>> monitored could be picked up in some way (metadata in the Chunk?) and
>>> the various Demux processors dependent on such external programs e.g.
>>> IoStat, Df, etc. could be parameterized to handle output from
>>> different versions/variants of the source program. Or to be even more
>>> general, the Exec adaptor could send along an MD5 hash of the program
>>> it's calling, and then you'd have a whole bunch of processors for
>>> every possible variant of the program you want to support; that sounds
>>> terribly hackish to me but I think that way at least the identity of
>>> the external dependency can be identified.

