flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jimmy <jimmyj...@gmail.com>
Subject Re: hdfs.fileType = CompressedStream
Date Thu, 30 Jan 2014 22:30:10 GMT
snappy is not splittable neither, combining with sequence files it gives
identical result - bulk dumps whole file into HDFS

I feel a bit uneasy to keep 120MB (almost 1GB uncompressed) file open for
one hour.....



On Thu, Jan 30, 2014 at 1:59 PM, Jeff Lord <jlord@cloudera.com> wrote:

> You are using gzip so the files won't splittable.
> You may be better off using snappy and sequence files.
>
>
> On Thu, Jan 30, 2014 at 10:51 AM, Jimmy <jimmyjack@gmail.com> wrote:
>
>> I am running few tests and would like to confirm whether this is true...
>>
>> hdfs.codeC = gzip
>> hdfs.fileType = CompressedStream
>> hdfs.writeFormat = Text
>> hdfs.batchSize = 100
>>
>>
>> now lets assume I have large number of transactions I roll file every 10
>> minutes
>>
>> it seems the tmp file stay 0bytes and flushes at once after 10 minutes vs
>> if I dont use compression, the file will grow as data are written to HDFS
>>
>> is this correct?
>>
>> Do you see any drawback in using compressedstream and with very large
>> files? In my case 120MB compressed file (block size) is 10x uncompressed
>>
>>
>

Mime
View raw message