Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of chilinglam@gmail.com designates
 209.85.215.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJ1NbZcsmE4zTDqy3ROoDh6dMUoVqbi4AX--5KGfLJ_Nqsd=OA@mail.gmail.com>
References: 
 <CAG+ckK_WSOBMBQkY3Qqk_ySwabZ1+oMpo=1Th-MN6QANtntoXg@mail.gmail.com>
	<CAJ1NbZcsmE4zTDqy3ROoDh6dMUoVqbi4AX--5KGfLJ_Nqsd=OA@mail.gmail.com>
Date: Tue, 10 Sep 2013 12:21:26 -0400
Message-ID: 
 <CAG+ckK_d9VbgbGMx=jSEc9TDzoC_1wHL427Z2hn0s14YUhFV+A@mail.gmail.com>
Subject: Re: Concatenate multiple sequence files into 1 big sequence file
From: Jerry Lam <chilinglam@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e0116093881a15a04e609e5c6

--089e0116093881a15a04e609e5c6
Content-Type: text/plain; charset=ISO-8859-1

Hi guys,

Thank you for all the advices here. I really appreciate it.

I read through the code in filecrush and I found out that it is doing
exactly what I'm currently doing.
The logic resides in CrushReducer.java with the following lines that do the
concatenation:

while (reader.next(key, value)) {

   sink.write(key, value);

   reporter.incrCounter(ReducerCounter.RECORDS_CRUSHED, 1);

  }

I wonder if there are other faster ways to do this? Preferably a solution
that involves only streaming a set of sequence files to the final sequence
file.

Best Regards,


Jerry


On Tue, Sep 10, 2013 at 11:20 AM, Adam Muise <amuise@hortonworks.com> wrote:

> Jerry,
>
> It might not help with this particular file, but you might considered the
> approach used at Blackberry when dealing with your data. They block
> compressed into small avro files and then concatenated into large avro
> files without decompressing. Check out the boom file format here:
>
> https://github.com/blackberry/hadoop-logdriver
>
> for now, use filecrush:
> https://github.com/edwardcapriolo/filecrush
>
> Cheers,
>
>
>
>
> On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam <chilinglam@gmail.com> wrote:
>
>> Hi Hadoop users,
>>
>> I have been trying to concatenate multiple sequence files into one.
>> Since the total size of the sequence files is quite big (1TB), I won't
>> use mapreduce because it requires 1TB in the reducer host to hold the
>> temporary data.
>>
>> I ended up doing what have been suggested in this thread:
>> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E
>>
>> It works very well. I wonder if there is a faster way to append to a
>> sequence file.
>>
>> Currently, the code looks like this (omit opening and closing sequence
>> files, exception handling etc):
>>
>> // each seq is a sequence file
>> // writer is a sequence file writer
>>         for (val seq : seqs) {
>>
>>           reader =new SequenceFile.Reader(conf,
>> Reader.file(seq.getPath()));
>>
>>             while (reader.next(readerKey, readerValue)) {
>>
>>               writer.append(readerKey, readerValue);
>>
>>             }
>>
>>         }
>>
>> Is there a better way to do this? Note that I think it is wasteful to
>> deserialize and serialize the key and value in the while loop because the
>> program simply append to the sequence file. Also, I don't seem to be able
>> to read and write fast enough (about 6MB/sec).
>>
>> Any advice is appreciated,
>>
>>
>> Jerry
>>
>
>
>
> --
> *
> *
> *
> *
> *Adam Muise*
> Solution Engineer
> *Hortonworks*
> amuise@hortonworks.com
> 416-417-4037
>
> Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop.<http://hortonworks.com/>
>
> Hortonworks Virtual Sandbox <http://hortonworks.com/sandbox>
>
> Hadoop: Disruptive Possibilities by Jeff Needham<http://hortonworks.com/resources/?did=72&cat=1>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

--089e0116093881a15a04e609e5c6
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi guys,<div><br></div><div>Thank you for all the advices =
here. I really appreciate it.</div><div><br></div><div>I read through the c=
ode in filecrush and I found out that it is doing exactly what I&#39;m curr=
ently doing.</div>
<div>The logic resides in CrushReducer.java with the following lines that d=
o the concatenation:</div><div><br></div><div>


<p class=3D""><span class=3D"">while</span> (reader.next(key, value)) {</p>
<p class=3D""><span class=3D"">	</span><span class=3D"">	</span><span class=
=3D"">	</span><span class=3D"">	</span><span class=3D"">	</span><span class=
=3D"">	</span>sink.write(key, value);</p>
<p class=3D""><span class=3D"">	</span><span class=3D"">	</span><span class=
=3D"">	</span><span class=3D"">	</span><span class=3D"">	</span><span class=
=3D"">	</span>reporter.incrCounter(ReducerCounter.<span class=3D"">RECORDS_=
CRUSHED</span>, 1);</p>

<p class=3D""><span class=3D"">	</span><span class=3D"">	</span><span class=
=3D"">	</span><span class=3D"">	</span><span class=3D"">	</span>}</p><p cla=
ss=3D"">I wonder if there are other faster ways to do this? Preferably a so=
lution that involves only streaming a set of sequence files to the final se=
quence file.</p>
<p class=3D"">Best Regards,</p><p class=3D""><br></p><p class=3D"">Jerry</p=
></div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">=
On Tue, Sep 10, 2013 at 11:20 AM, Adam Muise <span dir=3D"ltr">&lt;<a href=
=3D"mailto:amuise@hortonworks.com" target=3D"_blank">amuise@hortonworks.com=
</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Jerry,<div><br></div><div>I=
t might not help with this particular file, but you might considered the ap=
proach used at Blackberry when dealing with your data. They block compresse=
d into small avro files and then concatenated into large avro files without=
 decompressing. Check out the boom file format here:</div>

<div><br></div><div><a href=3D"https://github.com/blackberry/hadoop-logdriv=
er" target=3D"_blank">https://github.com/blackberry/hadoop-logdriver</a><br=
></div><div><br></div><div>for now, use filecrush:</div><div>
<a href=3D"https://github.com/edwardcapriolo/filecrush" target=3D"_blank">h=
ttps://github.com/edwardcapriolo/filecrush</a><br></div><div><br></div><div=
>Cheers,</div><div><br></div><div><br></div></div><div class=3D"gmail_extra=
">

<br><br><div class=3D"gmail_quote"><div class=3D"im">On Tue, Sep 10, 2013 a=
t 11:07 AM, Jerry Lam <span dir=3D"ltr">&lt;<a href=3D"mailto:chilinglam@gm=
ail.com" target=3D"_blank">chilinglam@gmail.com</a>&gt;</span> wrote:<br></=
div><div>
<div class=3D"h5"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr">Hi Hadoop users,<div><br></div><div>I have been trying to =
concatenate multiple sequence files into one.=A0</div><div>Since the total =
size of the sequence files is quite big (1TB), I won&#39;t use mapreduce be=
cause it requires 1TB in the reducer host to hold the temporary data.</div>


<div><br></div><div>I ended up doing what have been suggested in this threa=
d:=A0<a href=3D"http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/2=
01308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=3DJUa=3DSrXi51717w@mail.=
gmail.com%3E" target=3D"_blank">http://mail-archives.apache.org/mod_mbox/ha=
doop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=3DJUa=3D=
SrXi51717w@mail.gmail.com%3E</a></div>


<div><br></div><div>It works very well. I wonder if there is a faster way t=
o append to a sequence file.</div><div><br></div><div>Currently, the code l=
ooks like this (omit opening and closing sequence files, exception handling=
 etc):</div>


<div><br></div><div>// each seq is a sequence file</div><div>// writer is a=
 sequence file writer</div><div>=A0 =A0 =A0 =A0=A0<span>for</span> (<span>v=
al</span> seq : seqs) {<br></div><div>
<p>=A0 =A0 =A0 =A0 =A0 reader =3D<span>new</span> SequenceFile.Reader(conf,=
 Reader.file(seq.getPath()));</p>
<p>=A0 =A0 =A0 =A0 =A0 =A0 <span>while</span> (reader.next(readerKey, reade=
rValue)) {</p><p>=A0=A0 =A0 =A0 =A0 =A0 =A0 =A0writer.append(readerKey, rea=
derValue);</p>
<p>=A0 =A0 =A0 =A0 =A0 =A0 }</p>
<p>=A0 =A0 =A0 =A0 }<br></p><p>Is there a better way to do this? Note that =
I think it is wasteful to deserialize and serialize the key and value in th=
e while loop because the program simply append to the sequence file. Also, =
I don&#39;t seem to be able to read and write fast enough (about 6MB/sec).<=
/p>


<p>Any advice is appreciated,</p><p><br></p><p>Jerry</p></div></div>
</blockquote></div></div></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div dir=3D"ltr"><div style=3D"font-family:arial"><b><font size=3D"4"><br><=
/font></b></div><div style=3D"font-family:arial"><b><font size=3D"4"><br></=
font></b></div>
<div style=3D"font-family:arial">
<b><font size=3D"4">Adam Muise</font></b></div><div style=3D"font-family:ar=
ial;font-size:small">Solution Engineer</div><div style=3D"font-family:arial=
;font-size:small"><b style><font color=3D"#00ff00">Hortonworks</font></b></=
div>

<div style=3D"font-family:arial;font-size:small"><a href=3D"mailto:amuise@h=
ortonworks.com" target=3D"_blank">amuise@hortonworks.com</a></div><div styl=
e=3D"font-family:arial;font-size:small"><a href=3D"tel:416-417-4037" value=
=3D"+14164174037" target=3D"_blank">416-417-4037</a></div>
<div style=3D"font-family:arial;font-size:small">
<br></div><div style=3D"font-family:arial;font-size:small"><div style=3D"fo=
nt-family:arial,sans-serif;font-size:13px"><font color=3D"#0000ff"><font fa=
ce=3D"garamond, serif"><a href=3D"http://hortonworks.com/" style=3D"color:r=
gb(17,85,204)" target=3D"_blank">Hortonworks - Develops, Distributes and Su=
pports Enterprise Apache Hadoop.</a></font><br>

</font></div><div style=3D"font-family:arial,sans-serif;font-size:13px"><di=
v><font face=3D"garamond, serif" color=3D"#0000ff"><br></font></div><div><f=
ont face=3D"garamond, serif" color=3D"#0000ff"><a href=3D"http://hortonwork=
s.com/sandbox" style=3D"color:rgb(17,85,204)" target=3D"_blank">Hortonworks=
 Virtual Sandbox</a><br>

</font></div><div><font face=3D"garamond, serif" color=3D"#0000ff"><br></fo=
nt></div><div><a href=3D"http://hortonworks.com/resources/?did=3D72&amp;cat=
=3D1" style=3D"color:rgb(17,85,204)" target=3D"_blank">Hadoop: Disruptive P=
ossibilities by Jeff Needham</a></div>

</div></div></div>
</div>

<br>
<span style=3D"color:rgb(128,128,128);font-family:Arial,sans-serif;font-siz=
e:10px">CONFIDENTIALITY NOTICE</span><br style=3D"color:rgb(128,128,128);fo=
nt-family:Arial,sans-serif;font-size:10px"><span style=3D"color:rgb(128,128=
,128);font-family:Arial,sans-serif;font-size:10px">NOTICE: This message is =
intended for the use of the individual or entity to which it is addressed a=
nd may contain information that is confidential, privileged and exempt from=
 disclosure under applicable law. If the reader of this message is not the =
intended recipient, you are hereby notified that any printing, copying, dis=
semination, distribution, disclosure or forwarding of this communication is=
 strictly prohibited. If you have received this communication in error, ple=
ase contact the sender immediately and delete it from your system. Thank Yo=
u.</span></blockquote>
</div><br></div>

--089e0116093881a15a04e609e5c6--