hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashidhar Rao <raoshashidhar...@gmail.com>
Subject Re: Merging small files
Date Sun, 20 Jul 2014 17:47:43 GMT
Spring batch is used to process the files which come in EDI ,CSV & XML
format and store it into Oracle after processing, but this is for a very
small division. Imagine invoices generated  roughly  by 5 million customers
every week from  all stores plus from online purchases. Time to process
such massive data would be not acceptable even though Oracle would be a
good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and
we have no choice but to use Hadoop, but need further processing of input
files just to make hadoop happy .

On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   “Even if we kept the discussion to the mailing list's technical Hadoop
> usage focus, any company/organization looking to use a distro is going to
> have to consider the costs, support, platform, partner ecosystem, market
> share, company strategy, etc.”
> Yeah good point.
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>  *From:* Shahab Yunus <shahab.yunus@gmail.com>
> *Sent:* Sunday, July 20, 2014 11:32 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>  Why it isn't appropriate to discuss too much vendor specific topics on a
> vendor-neutral apache mailing list? Checkout this thread:
> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E
> You can always discuss vendor specific issues in their respective mailing
> lists.
> As for merging files, Yes one can use HBase but then you have to keep in
> mind that you are adding overhead of development and maintenance of a
> another store (i.e. HBase). If your use case could be satisfied with HDFS
> alone then why not keep it simple? And given the knowledge of the
> requirements that the OP provided, I think Sequence File format should work
> as I suggested initially. Of course, if things get too complicated from
> requirements perspective then one might try out HBase.
> Regards,
> Shahab
> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>>   It isn’t? I don’t wanna hijack the thread or anything but it seems to
>> me that MapR is an implementation of Hadoop and this is a great place to
>> discuss it’s merits vis a vis the Hortonworks or Cloudera offering.
>> A little bit more on topic: Every single thing I read or watch about
>> Hadoop says that many small files is a bad idea and that you should merge
>> them into larger files. I’ll take this a step further. If your invoice data
>> is so small, perhaps Hadoop isn’t the proper solution to whatever it is you
>> are trying to do and a more traditional RDBMS approach would be more
>> appropriate. Someone suggested HBase and I was going to suggest maybe one
>> of the other NoSQL databases, however, I remember that Eddie Satterly of
>> Splunk says that financial data is the ONE use case where a traditional
>> approach is more appropriate. You can watch his talk here:
>> https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>  *From:* Kilaru, Sambaiah <Sambaiah_Kilaru@intuit.com>
>> *Sent:* Sunday, July 20, 2014 3:47 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Merging small files
>>  This is not place to discuss merits or demerits of MapR, Small files
>> screw up very badly with Mapr.
>> Small files go into one container (to fill up 256MB or what ever
>> container size) and with locality most
>> Of the mappers go to three datanodes.
>> You should be looking into sequence file format.
>> Thanks,
>> Sam
>> From: "M. C. Srivas" <mcsrivas@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Date: Sunday, July 20, 2014 at 8:01 AM
>> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Subject: Re: Merging small files
>>  You should look at MapR .... a few 100's of billions of small files is
>> absolutely no problem. (disc: I work for MapR)
>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>>   Hi ,
>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>> block size is 256 MB but generally if we have to process retail invoice
>>> data , each invoice data is merely let's say 4 KB . Do we merge the invoice
>>> data to make one large file say 1 GB . What is the best practice in this
>>> scenario
>>> Regards
>>> Shashi

View raw message