Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 15CDE1184C for ; Sun, 20 Jul 2014 16:38:35 +0000 (UTC) Received: (qmail 22817 invoked by uid 500); 20 Jul 2014 16:38:24 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 22699 invoked by uid 500); 20 Jul 2014 16:38:24 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 22689 invoked by uid 99); 20 Jul 2014 16:38:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Jul 2014 16:38:24 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of adaryl.wakefield@hotmail.com designates 65.55.34.84 as permitted sender) Received: from [65.55.34.84] (HELO COL004-OMC2S10.hotmail.com) (65.55.34.84) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Jul 2014 16:38:20 +0000 Received: from COL126-DS10 ([65.55.34.73]) by COL004-OMC2S10.hotmail.com with Microsoft SMTPSVC(7.5.7601.22712); Sun, 20 Jul 2014 09:37:55 -0700 X-TMN: [Uw+ambZgSdZvHgQUtDTuSMi/xkHbxbkO] X-Originating-Email: [adaryl.wakefield@hotmail.com] Message-ID: From: "Adaryl \"Bob\" Wakefield, MBA" To: References: In-Reply-To: Subject: Re: Merging small files Date: Sun, 20 Jul 2014 11:37:54 -0500 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_09A4_01CFA40F.0BC6B3F0" X-Priority: 3 X-MSMail-Priority: Normal Importance: Normal X-Mailer: Microsoft Windows Live Mail 15.4.3555.308 X-MimeOLE: Produced By Microsoft MimeOLE V15.4.3555.308 X-OriginalArrivalTime: 20 Jul 2014 16:37:55.0508 (UTC) FILETIME=[F5263740:01CFA438] X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. ------=_NextPart_000_09A4_01CFA40F.0BC6B3F0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =E2=80=9CEven if we kept the discussion to the mailing list's technical = Hadoop usage focus, any company/organization looking to use a distro is = going to have to consider the costs, support, platform, partner = ecosystem, market share, company strategy, etc.=E2=80=9D Yeah good point. Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics 913.938.6685 www.linkedin.com/in/bobwakefieldmba From: Shahab Yunus=20 Sent: Sunday, July 20, 2014 11:32 AM To: user@hadoop.apache.org=20 Subject: Re: Merging small files Why it isn't appropriate to discuss too much vendor specific topics on a = vendor-neutral apache mailing list? Checkout this thread:=20 http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbo= x/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3= E You can always discuss vendor specific issues in their respective = mailing lists. As for merging files, Yes one can use HBase but then you have to keep in = mind that you are adding overhead of development and maintenance of a = another store (i.e. HBase). If your use case could be satisfied with = HDFS alone then why not keep it simple? And given the knowledge of the = requirements that the OP provided, I think Sequence File format should = work as I suggested initially. Of course, if things get too complicated = from requirements perspective then one might try out HBase. Regards, Shahab On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA = wrote: It isn=E2=80=99t? I don=E2=80=99t wanna hijack the thread or anything = but it seems to me that MapR is an implementation of Hadoop and this is = a great place to discuss it=E2=80=99s merits vis a vis the Hortonworks = or Cloudera offering.=20 A little bit more on topic: Every single thing I read or watch about = Hadoop says that many small files is a bad idea and that you should = merge them into larger files. I=E2=80=99ll take this a step further. If = your invoice data is so small, perhaps Hadoop isn=E2=80=99t the proper = solution to whatever it is you are trying to do and a more traditional = RDBMS approach would be more appropriate. Someone suggested HBase and I = was going to suggest maybe one of the other NoSQL databases, however, I = remember that Eddie Satterly of Splunk says that financial data is the = ONE use case where a traditional approach is more appropriate. You can = watch his talk here: https://www.youtube.com/watch?v=3D-N9i-YXoQBE&index=3D77&list=3DWL Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics 913.938.6685 www.linkedin.com/in/bobwakefieldmba From: Kilaru, Sambaiah=20 Sent: Sunday, July 20, 2014 3:47 AM To: user@hadoop.apache.org=20 Subject: Re: Merging small files This is not place to discuss merits or demerits of MapR, Small files = screw up very badly with Mapr. Small files go into one container (to fill up 256MB or what ever = container size) and with locality most Of the mappers go to three datanodes. You should be looking into sequence file format. Thanks, Sam From: "M. C. Srivas" Reply-To: "user@hadoop.apache.org" Date: Sunday, July 20, 2014 at 8:01 AM To: "user@hadoop.apache.org" Subject: Re: Merging small files You should look at MapR .... a few 100's of billions of small files is = absolutely no problem. (disc: I work for MapR) On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao = wrote: Hi , Has anybody worked in retail use case. If my production Hadoop = cluster block size is 256 MB but generally if we have to process retail = invoice data , each invoice data is merely let's say 4 KB . Do we merge = the invoice data to make one large file say 1 GB . What is the best = practice in this scenario Regards Shashi ------=_NextPart_000_09A4_01CFA40F.0BC6B3F0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
=E2=80=9CEven if we kept the discussion to the mailing list's = technical Hadoop=20 usage focus, any company/organization looking to use a distro is going = to have=20 to consider the costs, support, platform, partner ecosystem, market = share,=20 company strategy, etc.=E2=80=9D
 
Yeah good point.
 
Adaryl=20 "Bob" Wakefield, MBA
Principal
Mass Street=20 Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
 
Sent: Sunday, July 20, 2014 11:32 AM
Subject: Re: Merging small files
 
Why it isn't appropriate to discuss too much vendor = specific topics=20 on a vendor-neutral apache mailing list? Checkout this thread:=20
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user= /201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.= gmail.com%3E
 
You can always discuss vendor specific issues in their respective = mailing=20 lists.
 
As for merging files, Yes one can use HBase but then you have to = keep in=20 mind that you are adding overhead of development and maintenance of a = another=20 store (i.e. HBase). If your use case could be satisfied with HDFS alone = then why=20 not keep it simple? And given the knowledge of the requirements that the = OP=20 provided, I think Sequence File format should work as I suggested = initially. Of=20 course, if things get too complicated from requirements perspective then = one=20 might try out HBase.
 
Regards,
Shahab


On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" = Wakefield,=20 MBA <adaryl.wakefield@hotmail.com> wrote:
It isn=E2=80=99t? I don=E2=80=99t wanna hijack the thread or = anything but it seems to me=20 that MapR is an implementation of Hadoop and this is a great place to = discuss=20 it=E2=80=99s merits vis a vis the Hortonworks or Cloudera offering. =
 
A little bit more on topic: Every single thing I read or watch = about=20 Hadoop says that many small files is a bad idea and that you should = merge them=20 into larger files. I=E2=80=99ll take this a step further. If your = invoice data is so=20 small, perhaps Hadoop isn=E2=80=99t the proper solution to whatever it = is you are=20 trying to do and a more traditional RDBMS approach would be more = appropriate.=20 Someone suggested HBase and I was going to suggest maybe one of the = other=20 NoSQL databases, however, I remember that Eddie Satterly of Splunk = says that=20 financial data is the ONE use case where a traditional approach is = more=20 appropriate. You can watch his talk here:
 
https://www.youtube.com/watch?v=3D-N9i-YXoQBE&index=3D= 77&list=3DWL
 
Adaryl=20 "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
 
Sent: Sunday, July 20, 2014 3:47 AM
Subject: Re: Merging small files
 
This is not place to discuss merits or demerits of MapR, Small = files=20 screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever=20 container size) and with locality most
Of the mappers go to three datanodes.
 
You should be looking into sequence file format.
 
Thanks,
Sam
 
From: "M. C. Srivas" <mcsrivas@gmail.com>
Reply-To: "user@hadoop.apache.org"=20 <user@hadoop.apache.org>
Date: Sunday, July 20, 2014 at 8:01 = AM
To: "user@hadoop.apache.org"=20 <user@hadoop.apache.org>
Subject: Re: Merging small = files
 
You should look at MapR .... a few 100's of billions of = small=20 files is absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar = Rao <raoshashidhar123@gmail.com> wrote:
Hi ,

Has anybody worked in retail use case. If my=20 production Hadoop cluster block size is 256 MB but generally if we = have to=20 process retail invoice data , each invoice data is merely let's say = 4 KB .=20 Do we merge the invoice data to make one large file say 1 GB . What = is the=20 best practice in this=20 = scenario


Regards
Shashi
<= /DIV> =
 
 
------=_NextPart_000_09A4_01CFA40F.0BC6B3F0--