hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kilaru, Sambaiah" <Sambaiah_Kil...@intuit.com>
Subject Re: Merging small files
Date Sun, 20 Jul 2014 17:56:37 GMT
I had expericne with mapr where small files are much worse. Mapr can keep (only keep) small
files better agreed. Storing is not the answer,
You wanted to run the job and what happens?
A container stores files and container gets replicated, that means one container (of 256MB
or 128MB or what ever size it is configured) is
Replicated. The moment you start m/r job (and don’t use combinerinputfile format) you are
actually launching jobs on the three nodes due
To data localization issue.

Small files are bad with Hadoop and worse with mapr when you wanted to run job and good with


From: MBA <adaryl.wakefield@hotmail.com<mailto:adaryl.wakefield@hotmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 9:54 PM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: Re: Merging small files

It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR is
an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis the
Hortonworks or Cloudera offering.

A little bit more on topic: Every single thing I read or watch about Hadoop says that many
small files is a bad idea and that you should merge them into larger files. I’ll take this
a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution
to whatever it is you are trying to do and a more traditional RDBMS approach would be more
appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL
databases, however, I remember that Eddie Satterly of Splunk says that financial data is the
ONE use case where a traditional approach is more appropriate. You can watch his talk here:


Adaryl "Bob" Wakefield, MBA
Mass Street Analytics

From: Kilaru, Sambaiah<mailto:Sambaiah_Kilaru@intuit.com>
Sent: Sunday, July 20, 2014 3:47 AM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: Merging small files

This is not place to discuss merits or demerits of MapR, Small files screw up very badly with
Small files go into one container (to fill up 256MB or what ever container size) and with
locality most
Of the mappers go to three datanodes.

You should be looking into sequence file format.


From: "M. C. Srivas" <mcsrivas@gmail.com<mailto:mcsrivas@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Date: Sunday, July 20, 2014 at 8:01 AM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: Re: Merging small files

You should look at MapR .... a few 100's of billions of small files is absolutely no problem.
(disc: I work for MapR)

On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <raoshashidhar123@gmail.com<mailto:raoshashidhar123@gmail.com>>
Hi ,

Has anybody worked in retail use case. If my production Hadoop cluster block size is 256 MB
but generally if we have to process retail invoice data , each invoice data is merely let's
say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the best
practice in this scenario


View raw message