Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of adaryl.wakefield@hotmail.com
 designates 65.55.34.84 as permitted sender)
Message-ID: <COL126-DS10B1A762DCA2A2EDC593FB98F30@phx.gbl>
From: "Adaryl \"Bob\" Wakefield, MBA" <adaryl.wakefield@hotmail.com>
To: <user@hadoop.apache.org>
References: 
 <CFF17F92.CF51%Sambaiah_Kilaru@intuit.com><COL126-DS5F8F37DDDE024656DCDFA98F30@phx.gbl>
 <CAEo-6+QpuO8VDxr+RGHBFFr7waCxGRT7njrF7qh=EG4GAYbgrw@mail.gmail.com>
In-Reply-To: 
 <CAEo-6+QpuO8VDxr+RGHBFFr7waCxGRT7njrF7qh=EG4GAYbgrw@mail.gmail.com>
Subject: Re: Merging small files
Date: Sun, 20 Jul 2014 11:37:54 -0500
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_09A4_01CFA40F.0BC6B3F0"
Importance: Normal

This is a multi-part message in MIME format.

------=_NextPart_000_09A4_01CFA40F.0BC6B3F0
Content-Type: text/plain;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

=E2=80=9CEven if we kept the discussion to the mailing list's technical =
Hadoop usage focus, any company/organization looking to use a distro is =
going to have to consider the costs, support, platform, partner =
ecosystem, market share, company strategy, etc.=E2=80=9D

Yeah good point.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba

From: Shahab Yunus=20
Sent: Sunday, July 20, 2014 11:32 AM
To: user@hadoop.apache.org=20
Subject: Re: Merging small files

Why it isn't appropriate to discuss too much vendor specific topics on a =
vendor-neutral apache mailing list? Checkout this thread:=20
http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mbo=
x/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3=
E


You can always discuss vendor specific issues in their respective =
mailing lists.

As for merging files, Yes one can use HBase but then you have to keep in =
mind that you are adding overhead of development and maintenance of a =
another store (i.e. HBase). If your use case could be satisfied with =
HDFS alone then why not keep it simple? And given the knowledge of the =
requirements that the OP provided, I think Sequence File format should =
work as I suggested initially. Of course, if things get too complicated =
from requirements perspective then one might try out HBase.

Regards,
Shahab


On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA =
<adaryl.wakefield@hotmail.com> wrote:

  It isn=E2=80=99t? I don=E2=80=99t wanna hijack the thread or anything =
but it seems to me that MapR is an implementation of Hadoop and this is =
a great place to discuss it=E2=80=99s merits vis a vis the Hortonworks =
or Cloudera offering.=20

  A little bit more on topic: Every single thing I read or watch about =
Hadoop says that many small files is a bad idea and that you should =
merge them into larger files. I=E2=80=99ll take this a step further. If =
your invoice data is so small, perhaps Hadoop isn=E2=80=99t the proper =
solution to whatever it is you are trying to do and a more traditional =
RDBMS approach would be more appropriate. Someone suggested HBase and I =
was going to suggest maybe one of the other NoSQL databases, however, I =
remember that Eddie Satterly of Splunk says that financial data is the =
ONE use case where a traditional approach is more appropriate. You can =
watch his talk here:

  https://www.youtube.com/watch?v=3D-N9i-YXoQBE&index=3D77&list=3DWL

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba

  From: Kilaru, Sambaiah=20
  Sent: Sunday, July 20, 2014 3:47 AM
  To: user@hadoop.apache.org=20
  Subject: Re: Merging small files

  This is not place to discuss merits or demerits of MapR, Small files =
screw up very badly with Mapr.
  Small files go into one container (to fill up 256MB or what ever =
container size) and with locality most
  Of the mappers go to three datanodes.

  You should be looking into sequence file format.

  Thanks,
  Sam

  From: "M. C. Srivas" <mcsrivas@gmail.com>
  Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
  Date: Sunday, July 20, 2014 at 8:01 AM
  To: "user@hadoop.apache.org" <user@hadoop.apache.org>
  Subject: Re: Merging small files


  You should look at MapR .... a few 100's of billions of small files is =
absolutely no problem. (disc: I work for MapR)


  On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao =
<raoshashidhar123@gmail.com> wrote:

    Hi ,


    Has anybody worked in retail use case. If my production Hadoop =
cluster block size is 256 MB but generally if we have to process retail =
invoice data , each invoice data is merely let's say 4 KB . Do we merge =
the invoice data to make one large file say 1 GB . What is the best =
practice in this scenario


    Regards

    Shashi


------=_NextPart_000_09A4_01CFA40F.0BC6B3F0
Content-Type: text/html;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY dir=3Dltr>
<DIV dir=3Dltr>
<DIV style=3D"FONT-SIZE: 12pt; FONT-FAMILY: 'Calibri'; COLOR: #000000">
<DIV>=E2=80=9CEven if we kept the discussion to the mailing list's =
technical Hadoop=20
usage focus, any company/organization looking to use a distro is going =
to have=20
to consider the costs, support, platform, partner ecosystem, market =
share,=20
company strategy, etc.=E2=80=9D</DIV>
<DIV>&nbsp;</DIV>
<DIV>Yeah good point.</DIV>
<DIV>&nbsp;</DIV>
<DIV style=3D"FONT-SIZE: 12pt; FONT-FAMILY: 'Calibri'; COLOR: =
#000000">Adaryl=20
"Bob" Wakefield, MBA<BR>Principal<BR>Mass Street=20
Analytics<BR>913.938.6685<BR>www.linkedin.com/in/bobwakefieldmba</DIV>
<DIV=20
style=3D'FONT-SIZE: small; TEXT-DECORATION: none; FONT-FAMILY: =
"Calibri"; FONT-WEIGHT: normal; COLOR: #000000; FONT-STYLE: normal; =
DISPLAY: inline'>
<DIV style=3D"FONT: 10pt tahoma">
<DIV>&nbsp;</DIV>
<DIV style=3D"BACKGROUND: #f5f5f5">
<DIV style=3D"font-color: black"><B>From:</B> <A =
title=3Dshahab.yunus@gmail.com=20
href=3D"mailto:shahab.yunus@gmail.com">Shahab Yunus</A> </DIV>
<DIV><B>Sent:</B> Sunday, July 20, 2014 11:32 AM</DIV>
<DIV><B>To:</B> <A title=3Duser@hadoop.apache.org=20
href=3D"mailto:user@hadoop.apache.org">user@hadoop.apache.org</A> </DIV>
<DIV><B>Subject:</B> Re: Merging small files</DIV></DIV></DIV>
<DIV>&nbsp;</DIV></DIV>
<DIV=20
style=3D'FONT-SIZE: small; TEXT-DECORATION: none; FONT-FAMILY: =
"Calibri"; FONT-WEIGHT: normal; COLOR: #000000; FONT-STYLE: normal; =
DISPLAY: inline'>
<DIV dir=3Dltr>Why it isn't appropriate to discuss too much vendor =
specific topics=20
on a vendor-neutral apache mailing list? Checkout this thread:=20
<DIV><A=20
href=3D"http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/20=
1309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gma=
il.com%3E">http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user=
/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.=
gmail.com%3E</A><BR></DIV>
<DIV>&nbsp;</DIV>
<DIV>You can always discuss vendor specific issues in their respective =
mailing=20
lists.</DIV>
<DIV>&nbsp;</DIV>
<DIV>As for merging files, Yes one can use HBase but then you have to =
keep in=20
mind that you are adding overhead of development and maintenance of a =
another=20
store (i.e. HBase). If your use case could be satisfied with HDFS alone =
then why=20
not keep it simple? And given the knowledge of the requirements that the =
OP=20
provided, I think Sequence File format should work as I suggested =
initially. Of=20
course, if things get too complicated from requirements perspective then =
one=20
might try out HBase.</DIV>
<DIV>&nbsp;</DIV>
<DIV>Regards,</DIV>
<DIV>Shahab</DIV></DIV>
<DIV class=3Dgmail_extra><BR><BR>
<DIV class=3Dgmail_quote>On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" =
Wakefield,=20
MBA <SPAN dir=3Dltr>&lt;<A href=3D"mailto:adaryl.wakefield@hotmail.com"=20
target=3D_blank>adaryl.wakefield@hotmail.com</A>&gt;</SPAN> wrote:<BR>
<BLOCKQUOTE class=3Dgmail_quote=20
style=3D"PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc =
1px solid">
  <DIV=20
  style=3D"WORD-WRAP: break-word; FONT-SIZE: 14px; FONT-FAMILY: =
calibri,sans-serif; COLOR: rgb(0,0,0)"=20
  dir=3Dltr>
  <DIV dir=3Dltr>
  <DIV style=3D"FONT-SIZE: 12pt; FONT-FAMILY: 'Calibri'; COLOR: =
#000000">
  <DIV>It isn=E2=80=99t? I don=E2=80=99t wanna hijack the thread or =
anything but it seems to me=20
  that MapR is an implementation of Hadoop and this is a great place to =
discuss=20
  it=E2=80=99s merits vis a vis the Hortonworks or Cloudera offering. =
</DIV>
  <DIV>&nbsp;</DIV>
  <DIV>A little bit more on topic: Every single thing I read or watch =
about=20
  Hadoop says that many small files is a bad idea and that you should =
merge them=20
  into larger files. I=E2=80=99ll take this a step further. If your =
invoice data is so=20
  small, perhaps Hadoop isn=E2=80=99t the proper solution to whatever it =
is you are=20
  trying to do and a more traditional RDBMS approach would be more =
appropriate.=20
  Someone suggested HBase and I was going to suggest maybe one of the =
other=20
  NoSQL databases, however, I remember that Eddie Satterly of Splunk =
says that=20
  financial data is the ONE use case where a traditional approach is =
more=20
  appropriate. You can watch his talk here:</DIV>
  <DIV>&nbsp;</DIV>
  <DIV><A=20
  =
title=3Dhttps://www.youtube.com/watch?v=3D-N9i-YXoQBE&amp;index=3D77&amp;=
list=3DWL=20
  =
href=3D"https://www.youtube.com/watch?v=3D-N9i-YXoQBE&amp;index=3D77&amp;=
list=3DWL"=20
  =
target=3D_blank>https://www.youtube.com/watch?v=3D-N9i-YXoQBE&amp;index=3D=
77&amp;list=3DWL</A></DIV>
  <DIV>&nbsp;</DIV>
  <DIV style=3D"FONT-SIZE: 12pt; FONT-FAMILY: 'Calibri'; COLOR: =
#000000">Adaryl=20
  "Bob" Wakefield, MBA<BR>Principal<BR>Mass Street Analytics<BR><A=20
  href=3D"tel:913.938.6685" target=3D_blank=20
  value=3D"+19139386685">913.938.6685</A><BR><A=20
  href=3D"http://www.linkedin.com/in/bobwakefieldmba"=20
  target=3D_blank>www.linkedin.com/in/bobwakefieldmba</A></DIV>
  <DIV=20
  style=3D'FONT-SIZE: small; TEXT-DECORATION: none; FONT-FAMILY: =
"Calibri"; FONT-WEIGHT: normal; COLOR: #000000; FONT-STYLE: normal; =
DISPLAY: inline'>
  <DIV style=3D"FONT: 10pt tahoma">
  <DIV>&nbsp;</DIV>
  <DIV style=3D"BACKGROUND: #f5f5f5">
  <DIV><B>From:</B> <A title=3DSambaiah_Kilaru@intuit.com=20
  href=3D"mailto:Sambaiah_Kilaru@intuit.com" target=3D_blank>Kilaru, =
Sambaiah</A>=20
  </DIV>
  <DIV><B>Sent:</B> Sunday, July 20, 2014 3:47 AM</DIV>
  <DIV><B>To:</B> <A title=3Duser@hadoop.apache.org=20
  href=3D"mailto:user@hadoop.apache.org" =
target=3D_blank>user@hadoop.apache.org</A>=20
  </DIV>
  <DIV><B>Subject:</B> Re: Merging small files</DIV></DIV></DIV>
  <DIV>&nbsp;</DIV></DIV>
  <DIV=20
  style=3D'FONT-SIZE: small; TEXT-DECORATION: none; FONT-FAMILY: =
"Calibri"; FONT-WEIGHT: normal; COLOR: #000000; FONT-STYLE: normal; =
DISPLAY: inline'>
  <DIV>This is not place to discuss merits or demerits of MapR, Small =
files=20
  screw up very badly with Mapr.</DIV>
  <DIV>Small files go into one container (to fill up 256MB or what ever=20
  container size) and with locality most</DIV>
  <DIV>Of the mappers go to three datanodes.</DIV>
  <DIV>&nbsp;</DIV>
  <DIV>You should be looking into sequence file format.</DIV>
  <DIV>&nbsp;</DIV>
  <DIV>Thanks,</DIV>
  <DIV>Sam</DIV>
  <DIV>&nbsp;</DIV><SPAN>
  <DIV=20
  style=3D"FONT-SIZE: 11pt; BORDER-TOP: #b5c4df 1pt solid; FONT-FAMILY: =
calibri; BORDER-RIGHT: medium none; BORDER-BOTTOM: medium none; COLOR: =
black; PADDING-BOTTOM: 0in; TEXT-ALIGN: left; PADDING-TOP: 3pt; =
PADDING-LEFT: 0in; BORDER-LEFT: medium none; PADDING-RIGHT: 0in"><SPAN=20
  style=3D"FONT-WEIGHT: bold">From: </SPAN>"M. C. Srivas" &lt;<A=20
  href=3D"mailto:mcsrivas@gmail.com"=20
  target=3D_blank>mcsrivas@gmail.com</A>&gt;<BR><SPAN=20
  style=3D"FONT-WEIGHT: bold">Reply-To: </SPAN>"<A=20
  href=3D"mailto:user@hadoop.apache.org" =
target=3D_blank>user@hadoop.apache.org</A>"=20
  &lt;<A href=3D"mailto:user@hadoop.apache.org"=20
  target=3D_blank>user@hadoop.apache.org</A>&gt;<BR><SPAN=20
  style=3D"FONT-WEIGHT: bold">Date: </SPAN>Sunday, July 20, 2014 at 8:01 =

  AM<BR><SPAN style=3D"FONT-WEIGHT: bold">To: </SPAN>"<A=20
  href=3D"mailto:user@hadoop.apache.org" =
target=3D_blank>user@hadoop.apache.org</A>"=20
  &lt;<A href=3D"mailto:user@hadoop.apache.org"=20
  target=3D_blank>user@hadoop.apache.org</A>&gt;<BR><SPAN=20
  style=3D"FONT-WEIGHT: bold">Subject: </SPAN>Re: Merging small =
files<BR></DIV>
  <DIV>&nbsp;</DIV>
  <DIV>
  <DIV>
  <DIV dir=3Dltr>You should look at MapR .... a few 100's of billions of =
small=20
  files is absolutely no problem. (disc: I work for MapR)</DIV>
  <DIV class=3Dgmail_extra><BR><BR>
  <DIV class=3Dgmail_quote>On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar =
Rao <SPAN=20
  dir=3Dltr>&lt;<A href=3D"mailto:raoshashidhar123@gmail.com"=20
  target=3D_blank>raoshashidhar123@gmail.com</A>&gt;</SPAN> wrote:<BR>
  <BLOCKQUOTE class=3Dgmail_quote=20
  style=3D"PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: =
#ccc 1px solid">
    <DIV dir=3Dltr>
    <DIV>
    <DIV>
    <DIV>Hi ,<BR><BR></DIV>Has anybody worked in retail use case. If my=20
    production Hadoop cluster block size is 256 MB but generally if we =
have to=20
    process retail invoice data , each invoice data is merely let's say =
4 KB .=20
    Do we merge the invoice data to make one large file say 1 GB . What =
is the=20
    best practice in this=20
    =
scenario<BR><BR><BR></DIV>Regards<BR></DIV>Shashi<BR></DIV></BLOCKQUOTE><=
/DIV>
  =
<DIV>&nbsp;</DIV></DIV></DIV></DIV></SPAN></DIV></DIV></DIV></DIV></BLOCK=
QUOTE></DIV>
<DIV>&nbsp;</DIV></DIV></DIV></DIV></DIV></BODY></HTML>

------=_NextPart_000_09A4_01CFA40F.0BC6B3F0--