hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Anchlia <mohitanch...@gmail.com>
Subject Re: Design question
Date Thu, 26 Apr 2012 14:43:30 GMT
Ant suggestion or pointers would be helpful. Are there any best practices?

On Mon, Apr 23, 2012 at 3:27 PM, Mohit Anchlia <mohitanchlia@gmail.com>wrote:

> I just wanted to check how do people design their storage directories for
> data that is sent to the system continuously. For eg: for a given
> functionality we get data feed continuously writen to sequencefile, that is
> then coverted to more structured format using map reduce and stored in tab
> separated files. For such continuous feed what's the best way to organize
> directories and the names? Should it be just based of timestamp or
> something better that helps in organizing data.
> Second part of question, is it better to store output in sequence files so
> that we can take advantage of compression per record. This seems to be
> required since gzip/snappy compression of entire file would launch only one
> map tasks.
> And the last question, when compressing a flat file should it first be
> split into multiple files so that we get multiple mappers if we need to run
> another job on this file? LZO is another alternative but then it requires
> additional configuration, is it preferred?
> Any articles or suggestions would be very helpful.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message