chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel Rabkin <>
Subject Re: Trying to determine if Chukwa is what I need
Date Sat, 28 Jun 2014 16:22:43 GMT
 Chukwa should be a great fit for this use case. If you use the
DirTailingAdaptor, it should send every file. If none of the files
will get very big, I would use the FileAdaptor -- this sends each file
as one Chunk, so it'll be stored as consecutive bytes in HDFS, which
is the most convenient form for analysis. This relies on being able to
buffer the full contents in memory, so they all get written as a

If your files might be big -- many megabytes -- you might hit an
internal buffer-size limitation. Those are configurable and can be
adjusted to your needs.

If your files are big enough that you worry about RAM consumption
during the copy, you want to use an adaptor that breaks the files into
smaller chunks before sending. You probably should subclass
LWFTAdaptor to specify the semantics you want.

There are a couple ugly cases to think about, regardless of which
adaptor you use.

- What if a file is updated after being placed?
- What if a file is deleted and then a new file with the same name is created?
- What if a file is placed and then immediately moved or deleted,
before Chukwa finishes grabbing it?

I'm not sure which of those cases will come up for you or what you
want to have happen. You might want to do some testing to check that
you get the behavior you want.

I believe Eric's comments here are being overly cautious. Depending on
your file sizes and on what kind of downstream processing you need,
you might not optimal performance from MapReduce jobs run against the
output in HDFS, but if you're just interested in storage, it should be

On Sat, Jun 28, 2014 at 11:52 AM, Eric Yang <> wrote:
> Hi Jon,
> Chukwa can take files from a directory and ship to HDFS with some limitation.  First,
the data needs to be the same type within a directory.  Second, Chukwa does not ship identical
files to HDFS.  It extracts files into records before data is shipped to HDFS or HBase.  The
files written to HDFS is optimized for map reduce jobs because the files are closed at fix
interval.  This assumption is that collector creates similar files in size to ensure map reduce
tasks can execute in even amount of time for parallelization.  Chukwa is designed to ship
entry of records in log files.  It may not perform well to ship word document or images. 
Flume is designed to ship original files.  Therefore, if you have requirement to ship original
files and not records, flume maybe the better choice for that problem.
> For testing purpose, tailing files in a directory can be achieved using this command
in Chukwa agent port 9093:
> add DirTailingAdaptor logs /var/log/ *.log filetailer.CharFileTailingAdaptorUTF8 0
> This will spawn off multiple CharFileTailingAdaptorUTF8 to ship all log files within
the directory.  If the log files is removed, the adaptor is automatically shutdown.
> Hope this helps.
> regards,
> Eric
> On Fri, Jun 27, 2014 at 1:38 PM, Jonathan Mervine <> wrote:
>> Hey I came across chukwa from a blog post. And it looks like it  there is a real
effort in collecting data from multiple sources and pumping it into the HDFS.
>> I was looking at this pdf from the wiki
>> And the chart in the middle seems to imply that 2 of the agents you can have is one
that takes in streaming data and one that is associated with Log4J and works with log files
in particular.
>> I’m pretty new to Hadoop so I’m trying to learn a lot about it in a short time,
but what I’m looking for is some kind of system that will monitor a directory somewhere
for files being placed there. I don’t know what kind of files they could be, csv’s, psv’s,
doc’s, txt’s, and many others. A later stage would be formatting, parsing and analyzing
but for now I just want to be able to detect when a File is placed there. After a file has
been detected than it should be sent on it’s way to be placed into the HDFS. This should
be a completely autonomous and automatic process (or as much as possible).
>> Is this something Chukwa can help me with? If not do you know of any system that
might do what I want? I’ve kind of read a little about Oozie, Falcon, Flume, Scribe, and
a couple other projects but I don’t think I’ve found what I’m looking for.  Also any
information you could provide to help me on my way or clear up any misunderstanding I may
have would be great!
>> Thanks

Ari Rabkin
Princeton Computer Science Department

View raw message