hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahab Yunus <shahab.yu...@gmail.com>
Subject Re: Merging small files
Date Sun, 20 Jul 2014 16:32:11 GMT
Why it isn't appropriate to discuss too much vendor specific topics on a
vendor-neutral apache mailing list? Checkout this thread:
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E

You can always discuss vendor specific issues in their respective mailing
lists.

As for merging files, Yes one can use HBase but then you have to keep in
mind that you are adding overhead of development and maintenance of a
another store (i.e. HBase). If your use case could be satisfied with HDFS
alone then why not keep it simple? And given the knowledge of the
requirements that the OP provided, I think Sequence File format should work
as I suggested initially. Of course, if things get too complicated from
requirements perspective then one might try out HBase.

Regards,
Shahab


On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
> me that MapR is an implementation of Hadoop and this is a great place to
> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>
> A little bit more on topic: Every single thing I read or watch about
> Hadoop says that many small files is a bad idea and that you should merge
> them into larger files. I’ll take this a step further. If your invoice data
> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
> are trying to do and a more traditional RDBMS approach would be more
> appropriate. Someone suggested HBase and I was going to suggest maybe one
> of the other NoSQL databases, however, I remember that Eddie Satterly of
> Splunk says that financial data is the ONE use case where a traditional
> approach is more appropriate. You can watch his talk here:
>
> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Kilaru, Sambaiah <Sambaiah_Kilaru@intuit.com>
> *Sent:* Sunday, July 20, 2014 3:47 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  This is not place to discuss merits or demerits of MapR, Small files
> screw up very badly with Mapr.
> Small files go into one container (to fill up 256MB or what ever container
> size) and with locality most
> Of the mappers go to three datanodes.
>
> You should be looking into sequence file format.
>
> Thanks,
> Sam
>
> From: "M. C. Srivas" <mcsrivas@gmail.com>
> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Date: Sunday, July 20, 2014 at 8:01 AM
> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Subject: Re: Merging small files
>
>  You should look at MapR .... a few 100's of billions of small files is
> absolutely no problem. (disc: I work for MapR)
>
>
> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
> raoshashidhar123@gmail.com> wrote:
>
>>   Hi ,
>>
>> Has anybody worked in retail use case. If my production Hadoop cluster
>> block size is 256 MB but generally if we have to process retail invoice
>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>> data to make one large file say 1 GB . What is the best practice in this
>> scenario
>>
>>
>> Regards
>> Shashi
>>
>
>

Mime
View raw message