hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adaryl \"Bob\" Wakefield, MBA" <adaryl.wakefi...@hotmail.com>
Subject Re: Merging small files
Date Sun, 20 Jul 2014 18:31:48 GMT
Yeah  I’m sorry I’m not talking about processing the files in Oracle. I mean collect/store
invoices in Oracle then flush them in a batch to Hadoop. This is not real time right? So you
take your EDI,CSV and XML from their sources. Store them in Oracle. Once you have a decent
size, flush them to Hadoop in one big file, process them, then store the results of the processing
in Oracle.

Source file –> Oracle –> Hadoop –> Oracle

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shashidhar Rao 
Sent: Sunday, July 20, 2014 12:47 PM
To: user@hadoop.apache.org 
Subject: Re: Merging small files

Spring batch is used to process the files which come in EDI ,CSV & XML format and store
it into Oracle after processing, but this is for a very small division. Imagine invoices generated
 roughly  by 5 million customers every week from  all stores plus from online purchases. Time
to process such massive data would be not acceptable even though Oracle would be a good choice
as Adaryl Bob has suggested. Each invoice is not even 10 k and we have no choice but to use
Hadoop, but need further processing of input files just to make hadoop happy .




On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <adaryl.wakefield@hotmail.com>
wrote:

  “Even if we kept the discussion to the mailing list's technical Hadoop usage focus, any
company/organization looking to use a distro is going to have to consider the costs, support,
platform, partner ecosystem, market share, company strategy, etc.”

  Yeah good point.

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Shahab Yunus 
  Sent: Sunday, July 20, 2014 11:32 AM
  To: user@hadoop.apache.org 
  Subject: Re: Merging small files

  Why it isn't appropriate to discuss too much vendor specific topics on a vendor-neutral
apache mailing list? Checkout this thread: 
  http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E


  You can always discuss vendor specific issues in their respective mailing lists.

  As for merging files, Yes one can use HBase but then you have to keep in mind that you are
adding overhead of development and maintenance of a another store (i.e. HBase). If your use
case could be satisfied with HDFS alone then why not keep it simple? And given the knowledge
of the requirements that the OP provided, I think Sequence File format should work as I suggested
initially. Of course, if things get too complicated from requirements perspective then one
might try out HBase.

  Regards,
  Shahab



  On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <adaryl.wakefield@hotmail.com>
wrote:

    It isn’t? I don’t wanna hijack the thread or anything but it seems to me that MapR
is an implementation of Hadoop and this is a great place to discuss it’s merits vis a vis
the Hortonworks or Cloudera offering. 

    A little bit more on topic: Every single thing I read or watch about Hadoop says that
many small files is a bad idea and that you should merge them into larger files. I’ll take
this a step further. If your invoice data is so small, perhaps Hadoop isn’t the proper solution
to whatever it is you are trying to do and a more traditional RDBMS approach would be more
appropriate. Someone suggested HBase and I was going to suggest maybe one of the other NoSQL
databases, however, I remember that Eddie Satterly of Splunk says that financial data is the
ONE use case where a traditional approach is more appropriate. You can watch his talk here:

    https://www.youtube.com/watch?v=-N9i-YXoQBE&index=77&list=WL

    Adaryl "Bob" Wakefield, MBA
    Principal
    Mass Street Analytics
    913.938.6685
    www.linkedin.com/in/bobwakefieldmba

    From: Kilaru, Sambaiah 
    Sent: Sunday, July 20, 2014 3:47 AM
    To: user@hadoop.apache.org 
    Subject: Re: Merging small files

    This is not place to discuss merits or demerits of MapR, Small files screw up very badly
with Mapr.
    Small files go into one container (to fill up 256MB or what ever container size) and with
locality most
    Of the mappers go to three datanodes.

    You should be looking into sequence file format.

    Thanks,
    Sam

    From: "M. C. Srivas" <mcsrivas@gmail.com>
    Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
    Date: Sunday, July 20, 2014 at 8:01 AM
    To: "user@hadoop.apache.org" <user@hadoop.apache.org>
    Subject: Re: Merging small files


    You should look at MapR .... a few 100's of billions of small files is absolutely no problem.
(disc: I work for MapR)



    On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <raoshashidhar123@gmail.com> wrote:

      Hi ,


      Has anybody worked in retail use case. If my production Hadoop cluster block size is
256 MB but generally if we have to process retail invoice data , each invoice data is merely
let's say 4 KB . Do we merge the invoice data to make one large file say 1 GB . What is the
best practice in this scenario



      Regards

      Shashi




Mime
View raw message