hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Rabkin <asrab...@gmail.com>
Subject Re: HDFS as a logfile ??
Date Tue, 14 Apr 2009 19:53:30 GMT
Everything gets dumped into the same files.

We don't assume anything at all about the format of the input data; it
gets dumped into Hadoop sequence files, tagged with some metadata to
say what machine and app it came from, and where it was in the
original stream.

There is a slight penalty from the log-to-local disk. In practice, you
often want a local copy anyway, both for redundancy and because it's
much more convenient for human inspection.  Having a separate
collector process is indeed inelegant. However, HDFS copes badly with
many small files, so that pushes you to merge entries across either
hosts or time partitions. And since HDFS doesn't have a flush(),
having one log per source means that log entries don't become visible
quickly enough.   Hence, collectors.

It isn't gorgeous, but it seems to work fine in practice.

On Mon, Apr 13, 2009 at 8:01 AM, Ricky Ho <rho@adobe.com> wrote:
> Ari, thanks for your note.
> Like to understand more how Chukwa group log entries ... If I have appA running in machine
X, Y and appB running in machine Y, Z.  Each of them calling the Chukwa log API.
> Do I have all entries going in the same HDFS file ?  or 4 separated HDFS files based
on the App/Machine combination ?
> If the answer of first Q is "yes", then what if appA and appB has different format of
log entries ?
> If the answer of second Q is "yes", then are all these HDFS files cut at the same time
boundary ?
> Looks like in Chukwa, application first log to a daemon, which buffer-write the log entries
into a local file.  And there is a separate process to ship these data to a remote collector
daemon which issue the actual HDFS write.  I observe the following overhead ...
> 1) The overhead of extra write to local disk and ship the data over to the collector.
 If HDFS supports append, the application can then go directly to the HDFS
> 2) The centralized collector establish a bottleneck to the otherwise perfectly parallel
HDFS architecture.
> Am I missing something here ?

Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

View raw message