Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: 74.125.82.170 is neither permitted nor
 denied by domain of mark.kerzner@shmsoft.com)
MIME-Version: 1.0
In-Reply-To: <COL126-DS544A7BD7CC2F432A09B8998F30@phx.gbl>
References: <CFF17F92.CF51%Sambaiah_Kilaru@intuit.com>
	<COL126-DS5F8F37DDDE024656DCDFA98F30@phx.gbl>
	<CAEo-6+QpuO8VDxr+RGHBFFr7waCxGRT7njrF7qh=EG4GAYbgrw@mail.gmail.com>
	<COL126-DS10B1A762DCA2A2EDC593FB98F30@phx.gbl>
	<CAFY8jicAaASGCE111yC9ADp9V5bbRj8BN9nDPrWdcDieMVBtLA@mail.gmail.com>
	<COL126-DS544A7BD7CC2F432A09B8998F30@phx.gbl>
Date: Sun, 20 Jul 2014 14:08:45 -0500
Message-ID: 
 <CANYdkkNV9ZwGYG_YbZOXcB=vUOm7X7NRh3FTt371g6C2NmJzxw@mail.gmail.com>
Subject: Re: Merging small files
From: Mark Kerzner <mark.kerzner@shmsoft.com>
To: Hadoop User <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=f46d041826f63d7ed304fea4b8d6

--f46d041826f63d7ed304fea4b8d6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Bob,

you don't have to wait for batch. Here is my project (under development)
where I am using Storm for continuous file processing,
https://github.com/markkerzner/3VEed

Mark


On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Yeah  I=E2=80=99m sorry I=E2=80=99m not talking about processing the fi=
les in Oracle. I
> mean collect/store invoices in Oracle then flush them in a batch to Hadoo=
p.
> This is not real time right? So you take your EDI,CSV and XML from their
> sources. Store them in Oracle. Once you have a decent size, flush them to
> Hadoop in one big file, process them, then store the results of the
> processing in Oracle.
>
> Source file =E2=80=93> Oracle =E2=80=93> Hadoop =E2=80=93> Oracle
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
>
>  *From:* Shashidhar Rao <raoshashidhar123@gmail.com>
> *Sent:* Sunday, July 20, 2014 12:47 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Merging small files
>
>  Spring batch is used to process the files which come in EDI ,CSV & XML
> format and store it into Oracle after processing, but this is for a very
> small division. Imagine invoices generated  roughly  by 5 million custome=
rs
> every week from  all stores plus from online purchases. Time to process
> such massive data would be not acceptable even though Oracle would be a
> good choice as Adaryl Bob has suggested. Each invoice is not even 10 k an=
d
> we have no choice but to use Hadoop, but need further processing of input
> files just to make hadoop happy .
>
>
> On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   =E2=80=9CEven if we kept the discussion to the mailing list's technica=
l Hadoop
>> usage focus, any company/organization looking to use a distro is going t=
o
>> have to consider the costs, support, platform, partner ecosystem, market
>> share, company strategy, etc.=E2=80=9D
>>
>> Yeah good point.
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics
>> 913.938.6685
>> www.linkedin.com/in/bobwakefieldmba
>>
>>  *From:* Shahab Yunus <shahab.yunus@gmail.com>
>> *Sent:* Sunday, July 20, 2014 11:32 AM
>>  *To:* user@hadoop.apache.org
>> *Subject:* Re: Merging small files
>>
>>   Why it isn't appropriate to discuss too much vendor specific topics on
>> a vendor-neutral apache mailing list? Checkout this thread:
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mb=
ox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E
>>
>> You can always discuss vendor specific issues in their respective mailin=
g
>> lists.
>>
>> As for merging files, Yes one can use HBase but then you have to keep in
>> mind that you are adding overhead of development and maintenance of a
>> another store (i.e. HBase). If your use case could be satisfied with HDF=
S
>> alone then why not keep it simple? And given the knowledge of the
>> requirements that the OP provided, I think Sequence File format should w=
ork
>> as I suggested initially. Of course, if things get too complicated from
>> requirements perspective then one might try out HBase.
>>
>> Regards,
>> Shahab
>>
>>
>> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   It isn=E2=80=99t? I don=E2=80=99t wanna hijack the thread or anything=
 but it seems to
>>> me that MapR is an implementation of Hadoop and this is a great place t=
o
>>> discuss it=E2=80=99s merits vis a vis the Hortonworks or Cloudera offer=
ing.
>>>
>>> A little bit more on topic: Every single thing I read or watch about
>>> Hadoop says that many small files is a bad idea and that you should mer=
ge
>>> them into larger files. I=E2=80=99ll take this a step further. If your =
invoice data
>>> is so small, perhaps Hadoop isn=E2=80=99t the proper solution to whatev=
er it is you
>>> are trying to do and a more traditional RDBMS approach would be more
>>> appropriate. Someone suggested HBase and I was going to suggest maybe o=
ne
>>> of the other NoSQL databases, however, I remember that Eddie Satterly o=
f
>>> Splunk says that financial data is the ONE use case where a traditional
>>> approach is more appropriate. You can watch his talk here:
>>>
>>> https://www.youtube.com/watch?v=3D-N9i-YXoQBE&index=3D77&list=3DWL
>>>
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics
>>> 913.938.6685
>>> www.linkedin.com/in/bobwakefieldmba
>>>
>>>  *From:* Kilaru, Sambaiah <Sambaiah_Kilaru@intuit.com>
>>> *Sent:* Sunday, July 20, 2014 3:47 AM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Merging small files
>>>
>>>  This is not place to discuss merits or demerits of MapR, Small files
>>> screw up very badly with Mapr.
>>> Small files go into one container (to fill up 256MB or what ever
>>> container size) and with locality most
>>> Of the mappers go to three datanodes.
>>>
>>> You should be looking into sequence file format.
>>>
>>> Thanks,
>>> Sam
>>>
>>> From: "M. C. Srivas" <mcsrivas@gmail.com>
>>> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>>> Date: Sunday, July 20, 2014 at 8:01 AM
>>> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>>> Subject: Re: Merging small files
>>>
>>>  You should look at MapR .... a few 100's of billions of small files is
>>> absolutely no problem. (disc: I work for MapR)
>>>
>>>
>>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>>   Hi ,
>>>>
>>>> Has anybody worked in retail use case. If my production Hadoop cluster
>>>> block size is 256 MB but generally if we have to process retail invoic=
e
>>>> data , each invoice data is merely let's say 4 KB . Do we merge the in=
voice
>>>> data to make one large file say 1 GB . What is the best practice in th=
is
>>>> scenario
>>>>
>>>>
>>>> Regards
>>>> Shashi
>>>>
>>>
>>>
>>
>>
>
>

--f46d041826f63d7ed304fea4b8d6
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Bob,<div><br></div><div>you don&#39;t have to wait for bat=
ch. Here is my project (under development) where I am using Storm for conti=
nuous file processing,=C2=A0<a href=3D"https://github.com/markkerzner/3VEed=
">https://github.com/markkerzner/3VEed</a></div>
<div><br></div><div>Mark</div></div><div class=3D"gmail_extra"><br><br><div=
 class=3D"gmail_quote">On Sun, Jul 20, 2014 at 1:31 PM, Adaryl &quot;Bob&qu=
ot; Wakefield, MBA <span dir=3D"ltr">&lt;<a href=3D"mailto:adaryl.wakefield=
@hotmail.com" target=3D"_blank">adaryl.wakefield@hotmail.com</a>&gt;</span>=
 wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div dir=3D"ltr">
<div dir=3D"ltr">
<div style=3D"FONT-SIZE:12pt;FONT-FAMILY:&#39;Calibri&#39;;COLOR:#000000">
<div>Yeah=C2=A0 I=E2=80=99m sorry I=E2=80=99m not talking about processing =
the files in Oracle.=20
I mean collect/store invoices in Oracle then flush them in a batch to Hadoo=
p.=20
This is not real time right? So you take your EDI,CSV and XML from their=20
sources. Store them in Oracle. Once you have a decent size, flush them to H=
adoop=20
in one big file, process them, then store the results of the processing in=
=20
Oracle.</div>
<div>=C2=A0</div>
<div>Source file =E2=80=93&gt; Oracle =E2=80=93&gt; Hadoop =E2=80=93&gt; Or=
acle</div>
<div>=C2=A0</div>
<div style=3D"FONT-SIZE:12pt;FONT-FAMILY:&#39;Calibri&#39;;COLOR:#000000">A=
daryl=20
&quot;Bob&quot; Wakefield, MBA<br>Principal<br>Mass Street=20
Analytics<br><a href=3D"tel:913.938.6685" value=3D"+19139386685" target=3D"=
_blank">913.938.6685</a><br><a href=3D"http://www.linkedin.com/in/bobwakefi=
eldmba" target=3D"_blank">www.linkedin.com/in/bobwakefieldmba</a></div>
<div style=3D"FONT-SIZE:small;TEXT-DECORATION:none;FONT-FAMILY:&quot;Calibr=
i&quot;;FONT-WEIGHT:normal;COLOR:#000000;FONT-STYLE:normal;DISPLAY:inline">
<div style=3D"FONT:10pt tahoma">
<div>=C2=A0</div>
<div style=3D"BACKGROUND:#f5f5f5">
<div><b>From:</b> <a title=3D"raoshashidhar123@gmail.com" href=3D"mailto:ra=
oshashidhar123@gmail.com" target=3D"_blank">Shashidhar Rao</a> </div>
<div><b>Sent:</b> Sunday, July 20, 2014 12:47 PM</div>
<div><b>To:</b> <a title=3D"user@hadoop.apache.org" href=3D"mailto:user@had=
oop.apache.org" target=3D"_blank">user@hadoop.apache.org</a> </div>
<div><b>Subject:</b> Re: Merging small files</div></div></div>
<div>=C2=A0</div></div>
<div style=3D"FONT-SIZE:small;TEXT-DECORATION:none;FONT-FAMILY:&quot;Calibr=
i&quot;;FONT-WEIGHT:normal;COLOR:#000000;FONT-STYLE:normal;DISPLAY:inline">
<div dir=3D"ltr">Spring batch is used to process the files which come in ED=
I ,CSV=20
&amp; XML format and store it into Oracle after processing, but this is for=
 a=20
very small division. Imagine invoices generated=C2=A0 roughly=C2=A0 by 5 mi=
llion=20
customers every week from=C2=A0 all stores plus from online purchases. Time=
 to=20
process such massive data would be not acceptable even though Oracle would =
be a=20
good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and =
we=20
have no choice but to use Hadoop, but need further processing of input file=
s=20
just to make hadoop happy .<br></div>
<div class=3D"gmail_extra"><br><br>
<div class=3D"gmail_quote">On Sun, Jul 20, 2014 at 10:07 PM, Adaryl &quot;B=
ob&quot; Wakefield,=20
MBA <span dir=3D"ltr">&lt;<a href=3D"mailto:adaryl.wakefield@hotmail.com" t=
arget=3D"_blank">adaryl.wakefield@hotmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"PADDING-LEFT:1ex;MARGIN:0px 0px =
0px 0.8ex;BORDER-LEFT:#ccc 1px solid">
  <div dir=3D"ltr">
  <div dir=3D"ltr">
  <div style=3D"FONT-SIZE:12pt;FONT-FAMILY:&#39;Calibri&#39;;COLOR:#000000"=
>
  <div>=E2=80=9CEven if we kept the discussion to the mailing list&#39;s te=
chnical Hadoop=20
  usage focus, any company/organization looking to use a distro is going to=
 have=20
  to consider the costs, support, platform, partner ecosystem, market share=
,=20
  company strategy, etc.=E2=80=9D</div>
  <div>=C2=A0</div>
  <div>Yeah good point.</div>
  <div>
  <div>=C2=A0</div>
  <div style=3D"FONT-SIZE:12pt;FONT-FAMILY:&#39;Calibri&#39;;COLOR:#000000"=
>Adaryl=20
  &quot;Bob&quot; Wakefield, MBA<br>Principal<br>Mass Street=20
  Analytics<br><a href=3D"tel:913.938.6685" value=3D"+19139386685" target=
=3D"_blank">913.938.6685</a><br><a href=3D"http://www.linkedin.com/in/bobwa=
kefieldmba" target=3D"_blank">www.linkedin.com/in/bobwakefieldmba</a></div>=
</div>
  <div style=3D"FONT-SIZE:small;TEXT-DECORATION:none;FONT-FAMILY:&quot;Cali=
bri&quot;;FONT-WEIGHT:normal;COLOR:#000000;FONT-STYLE:normal;DISPLAY:inline=
">
  <div style=3D"FONT:10pt tahoma">
  <div>=C2=A0</div>
  <div style=3D"BACKGROUND:#f5f5f5">
  <div><b>From:</b> <a title=3D"shahab.yunus@gmail.com" href=3D"mailto:shah=
ab.yunus@gmail.com" target=3D"_blank">Shahab Yunus</a> </div>
  <div><b>Sent:</b> Sunday, July 20, 2014 11:32 AM</div>
  <div>
  <div>
  <div><b>To:</b> <a title=3D"user@hadoop.apache.org" href=3D"mailto:user@h=
adoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>=20
  </div>
  <div><b>Subject:</b> Re: Merging small files</div></div></div></div></div=
>
  <div>=C2=A0</div></div>
  <div>
  <div>
  <div style=3D"FONT-SIZE:small;TEXT-DECORATION:none;FONT-FAMILY:&quot;Cali=
bri&quot;;FONT-WEIGHT:normal;COLOR:#000000;FONT-STYLE:normal;DISPLAY:inline=
">
  <div dir=3D"ltr">Why it isn&#39;t appropriate to discuss too much vendor =
specific=20
  topics on a vendor-neutral apache mailing list? Checkout this thread:=20
  <div><a href=3D"http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce=
-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@ma=
il.gmail.com%3E" target=3D"_blank">http://mail-archives.apache.org/mod_mbox=
/hadoop-mapreduce-user/201309.mbox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ=
-6NahQ_e4dXYGA@mail.gmail.com%3E</a><br>
</div>
  <div>=C2=A0</div>
  <div>You can always discuss vendor specific issues in their respective ma=
iling=20
  lists.</div>
  <div>=C2=A0</div>
  <div>As for merging files, Yes one can use HBase but then you have to kee=
p in=20
  mind that you are adding overhead of development and maintenance of a ano=
ther=20
  store (i.e. HBase). If your use case could be satisfied with HDFS alone t=
hen=20
  why not keep it simple? And given the knowledge of the requirements that =
the=20
  OP provided, I think Sequence File format should work as I suggested=20
  initially. Of course, if things get too complicated from requirements=20
  perspective then one might try out HBase.</div>
  <div>=C2=A0</div>
  <div>Regards,</div>
  <div>Shahab</div></div>
  <div class=3D"gmail_extra"><br><br>
  <div class=3D"gmail_quote">On Sun, Jul 20, 2014 at 12:24 PM, Adaryl &quot=
;Bob&quot;=20
  Wakefield, MBA <span dir=3D"ltr">&lt;<a href=3D"mailto:adaryl.wakefield@h=
otmail.com" target=3D"_blank">adaryl.wakefield@hotmail.com</a>&gt;</span> w=
rote:<br>
  <blockquote class=3D"gmail_quote" style=3D"PADDING-LEFT:1ex;MARGIN:0px 0p=
x 0px 0.8ex;BORDER-LEFT:#ccc 1px solid">
    <div style=3D"WORD-WRAP:break-word;FONT-SIZE:14px;FONT-FAMILY:calibri,s=
ans-serif;COLOR:rgb(0,0,0)" dir=3D"ltr">
    <div dir=3D"ltr">
    <div style=3D"FONT-SIZE:12pt;FONT-FAMILY:&#39;Calibri&#39;;COLOR:#00000=
0">
    <div>It isn=E2=80=99t? I don=E2=80=99t wanna hijack the thread or anyth=
ing but it seems to=20
    me that MapR is an implementation of Hadoop and this is a great place t=
o=20
    discuss it=E2=80=99s merits vis a vis the Hortonworks or Cloudera offer=
ing. </div>
    <div>=C2=A0</div>
    <div>A little bit more on topic: Every single thing I read or watch abo=
ut=20
    Hadoop says that many small files is a bad idea and that you should mer=
ge=20
    them into larger files. I=E2=80=99ll take this a step further. If your =
invoice data=20
    is so small, perhaps Hadoop isn=E2=80=99t the proper solution to whatev=
er it is you=20
    are trying to do and a more traditional RDBMS approach would be more=20
    appropriate. Someone suggested HBase and I was going to suggest maybe o=
ne of=20
    the other NoSQL databases, however, I remember that Eddie Satterly of S=
plunk=20
    says that financial data is the ONE use case where a traditional approa=
ch is=20
    more appropriate. You can watch his talk here:</div>
    <div>=C2=A0</div>
    <div><a title=3D"https://www.youtube.com/watch?v=3D-N9i-YXoQBE&amp;inde=
x=3D77&amp;list=3DWL" href=3D"https://www.youtube.com/watch?v=3D-N9i-YXoQBE=
&amp;index=3D77&amp;list=3DWL" target=3D"_blank">https://www.youtube.com/wa=
tch?v=3D-N9i-YXoQBE&amp;index=3D77&amp;list=3DWL</a></div>

    <div>=C2=A0</div>
    <div style=3D"FONT-SIZE:12pt;FONT-FAMILY:&#39;Calibri&#39;;COLOR:#00000=
0">Adaryl=20
    &quot;Bob&quot; Wakefield, MBA<br>Principal<br>Mass Street Analytics<br=
><a href=3D"tel:913.938.6685" value=3D"+19139386685" target=3D"_blank">913.=
938.6685</a><br><a href=3D"http://www.linkedin.com/in/bobwakefieldmba" targ=
et=3D"_blank">www.linkedin.com/in/bobwakefieldmba</a></div>

    <div style=3D"FONT-SIZE:small;TEXT-DECORATION:none;FONT-FAMILY:&quot;Ca=
libri&quot;;FONT-WEIGHT:normal;COLOR:#000000;FONT-STYLE:normal;DISPLAY:inli=
ne">
    <div style=3D"FONT:10pt tahoma">
    <div>=C2=A0</div>
    <div style=3D"BACKGROUND:#f5f5f5">
    <div><b>From:</b> <a title=3D"Sambaiah_Kilaru@intuit.com" href=3D"mailt=
o:Sambaiah_Kilaru@intuit.com" target=3D"_blank">Kilaru, Sambaiah</a>=20
    </div>
    <div><b>Sent:</b> Sunday, July 20, 2014 3:47 AM</div>
    <div><b>To:</b> <a title=3D"user@hadoop.apache.org" href=3D"mailto:user=
@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a> </div>
    <div><b>Subject:</b> Re: Merging small files</div></div></div>
    <div>=C2=A0</div></div>
    <div style=3D"FONT-SIZE:small;TEXT-DECORATION:none;FONT-FAMILY:&quot;Ca=
libri&quot;;FONT-WEIGHT:normal;COLOR:#000000;FONT-STYLE:normal;DISPLAY:inli=
ne">
    <div>This is not place to discuss merits or demerits of MapR, Small fil=
es=20
    screw up very badly with Mapr.</div>
    <div>Small files go into one container (to fill up 256MB or what ever=
=20
    container size) and with locality most</div>
    <div>Of the mappers go to three datanodes.</div>
    <div>=C2=A0</div>
    <div>You should be looking into sequence file format.</div>
    <div>=C2=A0</div>
    <div>Thanks,</div>
    <div>Sam</div>
    <div>=C2=A0</div><span>
    <div style=3D"FONT-SIZE:11pt;BORDER-TOP:#b5c4df 1pt solid;FONT-FAMILY:c=
alibri;BORDER-RIGHT:medium none;BORDER-BOTTOM:medium none;COLOR:black;PADDI=
NG-BOTTOM:0in;TEXT-ALIGN:left;PADDING-TOP:3pt;PADDING-LEFT:0in;BORDER-LEFT:=
medium none;PADDING-RIGHT:0in">
<span style=3D"FONT-WEIGHT:bold">From: </span>&quot;M. C. Srivas&quot; &lt;=
<a href=3D"mailto:mcsrivas@gmail.com" target=3D"_blank">mcsrivas@gmail.com<=
/a>&gt;<br><span style=3D"FONT-WEIGHT:bold">Reply-To: </span>&quot;<a href=
=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org=
</a>&quot; &lt;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">=
user@hadoop.apache.org</a>&gt;<br>
<span style=3D"FONT-WEIGHT:bold">Date: </span>Sunday, July 20, 2014 at 8:01=
=20
    AM<br><span style=3D"FONT-WEIGHT:bold">To: </span>&quot;<a href=3D"mail=
to:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quo=
t; &lt;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@had=
oop.apache.org</a>&gt;<br>
<span style=3D"FONT-WEIGHT:bold">Subject: </span>Re: Merging small files<br=
></div>
    <div>=C2=A0</div>
    <div>
    <div>
    <div dir=3D"ltr">You should look at MapR .... a few 100&#39;s of billio=
ns of small=20
    files is absolutely no problem. (disc: I work for MapR)</div>
    <div class=3D"gmail_extra"><br><br>
    <div class=3D"gmail_quote">On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar=
 Rao=20
    <span dir=3D"ltr">&lt;<a href=3D"mailto:raoshashidhar123@gmail.com" tar=
get=3D"_blank">raoshashidhar123@gmail.com</a>&gt;</span> wrote:<br>
    <blockquote class=3D"gmail_quote" style=3D"PADDING-LEFT:1ex;MARGIN:0px =
0px 0px 0.8ex;BORDER-LEFT:#ccc 1px solid">
      <div dir=3D"ltr">
      <div>
      <div>
      <div>Hi ,<br><br></div>Has anybody worked in retail use case. If my=
=20
      production Hadoop cluster block size is 256 MB but generally if we ha=
ve to=20
      process retail invoice data , each invoice data is merely let&#39;s s=
ay 4 KB .=20
      Do we merge the invoice data to make one large file say 1 GB . What i=
s the=20
      best practice in this=20
      scenario<br><br><br></div>Regards<br></div>Shashi<br></div></blockquo=
te></div>
    <div>=C2=A0</div></div></div></div></span></div></div></div></div></blo=
ckquote></div>
  <div>=C2=A0</div></div></div></div></div></div></div></div></blockquote><=
/div>
<div>=C2=A0</div></div></div></div></div></div>
</blockquote></div><br></div>

--f46d041826f63d7ed304fea4b8d6--