flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sagar Mehta <sagarme...@gmail.com>
Subject Re: Question about gzip compression when using Flume Ng
Date Tue, 15 Jan 2013 02:36:18 GMT
Bhaskar,

Your suggestion worked like magic!! I don't believe my eyes!!

hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget
/ngpipes-raw-logs/2013-01-15/0200/collector102.ngpipes.sac.ngmoco.com.1358216630511.gz
.

hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip
collector102.ngpipes.sac.ngmoco.com.1358216630511.gz
hadoop@jobtracker301:/home/hadoop/sagar/debug$ ls -ltrh
total 34M
-rw-r--r-- 1 hadoop hadoop 34M 2013-01-15 02:29
collector102.ngpipes.sac.ngmoco.com.1358216630511

The file decompresses fine!!

This is what I did

   - Downloaded the latest Cloudera stuff here -
   https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
   - It installed hadoop to /usr/lib and I pointed the HADOOP_HOME to
   /usr/lib/hadoop and restarted Flume!!
   - Thats it!! - time to party :)

Thank you so much guys for your prompt replies!!

Sagar


On Mon, Jan 14, 2013 at 5:25 PM, Bhaskar V. Karambelkar <bhaskarvk@gmail.com
> wrote:

> Sagar,
> You're better of downloading and unzipping CDH3u5 or CDH4 some where, and
> pointing the HADOOP_HOME env. variable to the base directory.
> That way you won't have to worry about which jar files are needed and
> which not.
> Flume will auto add all JARs from the Hadoop Installation that it needs.
>
> regards
> Bhaskar
>
>
> On Mon, Jan 14, 2013 at 7:43 PM, Sagar Mehta <sagarmehta@gmail.com> wrote:
>
>> ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got some
>> errors about the guava dependencies so put in that jar too]
>>
>> smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" -e
>> "guava"
>> -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar
>> -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50
>> hadoop-core-0.20.2-cdh3u5.jar
>>
>> Now I don't event see the file being created in hdfs and the flume log is
>> happily talking about housekeeping for some file channel checkpoints,
>> updating pointers et al
>>
>> Below is tail of flume log
>>
>> *hadoop@collector102:/data/flume_log$ tail -10 flume.log*
>> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO
>>  org.apache.flume.channel.file.Log - Updated checkpoint for file:
>> /data/flume_data/channel2/data/log-36 position: 129415524 logWriteOrderID:
>> 1358209947324
>> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO
>>  org.apache.flume.channel.file.LogFile - Closing RandomReader
>> /data/flume_data/channel2/data/log-34
>> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO
>>  org.apache.flume.channel.file.Log - Updated checkpoint for file:
>> /data/flume_data/channel1/data/log-36 position: 129415524 logWriteOrderID:
>> 1358209947323
>> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO
>>  org.apache.flume.channel.file.LogFile - Closing RandomReader
>> /data/flume_data/channel1/data/log-34
>> 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO
>>  org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta
>> currentPosition = 18577138, logWriteOrderID = 1358209947324
>> 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO
>>  org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta
>> currentPosition = 18577138, logWriteOrderID = 1358209947323
>> 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO
>>  org.apache.flume.channel.file.LogFile - Closing RandomReader
>> /data/flume_data/channel1/data/log-35
>> 2013-01-15 00:42:10,821 [Log-BackgroundWorker-channel2] INFO
>>  org.apache.flume.channel.file.LogFile - Closing RandomReader
>> /data/flume_data/channel2/data/log-35
>> 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel1] INFO
>>  org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta
>> currentPosition = 217919486, logWriteOrderID = 1358209947323
>> 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel2] INFO
>>  org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta
>> currentPosition = 217919486, logWriteOrderID = 1358209947324
>>
>> Sagar
>>
>>
>> On Mon, Jan 14, 2013 at 3:38 PM, Brock Noland <brock@cloudera.com> wrote:
>>
>>> Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old,
>>> I would upgrade to CDH3u5 or CDH 4.1.2.
>>>
>>> On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <sagarmehta@gmail.com>
>>> wrote:
>>> > About the bz2 suggestion, we have a ton of downstream jobs that assume
>>> gzip
>>> > compressed files - so it is better to stick to gzip.
>>> >
>>> > The plan B for us is to have a Oozie step to gzip compress the logs
>>> before
>>> > proceeding with downstream Hadoop jobs - but that looks like a hack to
>>> me!!
>>> >
>>> > Sagar
>>> >
>>> >
>>> > On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <sagarmehta@gmail.com>
>>> wrote:
>>> >>
>>> >> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat
>>> >> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l
>>> >>
>>> >> gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz:
>>> decompression
>>> >> OK, trailing garbage ignored
>>> >> 100
>>> >>
>>> >> This should be about 50,000 events for the 5 min window!!
>>> >>
>>> >> Sagar
>>> >>
>>> >> On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <brock@cloudera.com>
>>> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> Can you try:  zcat file > output
>>> >>>
>>> >>> I think what is occurring is because of the flush the output file
is
>>> >>> actually several concatenated gz files.
>>> >>>
>>> >>> Brock
>>> >>>
>>> >>> On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <sagarmehta@gmail.com>
>>> >>> wrote:
>>> >>> > Yeah I have tried the text write format in vain before, but
>>> >>> > nevertheless
>>> >>> > gave it a try again!! Below is the latest file - still the
same
>>> thing.
>>> >>> >
>>> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date
>>> >>> > Mon Jan 14 23:02:07 UTC 2013
>>> >>> >
>>> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls
>>> >>> >
>>> >>> >
>>> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> >>> > Found 1 items
>>> >>> > -rw-r--r--   3 hadoop supergroup    4798117 2013-01-14 22:55
>>> >>> >
>>> >>> >
>>> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> >>> >
>>> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget
>>> >>> >
>>> >>> >
>>> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> >>> > .
>>> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip
>>> >>> > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>>> >>> >
>>> >>> > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz:
>>> >>> > decompression
>>> >>> > OK, trailing garbage ignored
>>> >>> >
>>> >>> > Interestingly enough, the gzip page says it is a harmless warning
-
>>> >>> > http://www.gzip.org/#faq8
>>> >>> >
>>> >>> > However, I'm losing events on decompression so I cannot afford
to
>>> >>> > ignore
>>> >>> > this warning. The gzip page gives an example about magnetic
tape -
>>> >>> > there is
>>> >>> > an analogy of hdfs block here since the file is initially stored
in
>>> >>> > hdfs
>>> >>> > before I pull it out on the local filesystem.
>>> >>> >
>>> >>> > Sagar
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson
>>> >>> > <cwoodson.dev@gmail.com>
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> collector102.sinks.sink1.hdfs.writeFormat = TEXT
>>> >>> >> collector102.sinks.sink2.hdfs.writeFormat = TEXT
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Apache MRUnit - Unit testing MapReduce -
>>> >>> http://incubator.apache.org/mrunit/
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Apache MRUnit - Unit testing MapReduce -
>>> http://incubator.apache.org/mrunit/
>>>
>>
>>
>

Mime
View raw message