Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of publicnetworkservices@gmail.com
 designates 209.85.216.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAGScPGvowZ=cBPcLPBa0SFA-RBQXtwnAoZR2SgraV6hgCc55ZA@mail.gmail.com>
References: 
 <CAPwWofHbA4Mb=MNK_89jWK2YuWk=64doYa-x+mTWx6Ppk_hDtg@mail.gmail.com>
	<CAGScPGvowZ=cBPcLPBa0SFA-RBQXtwnAoZR2SgraV6hgCc55ZA@mail.gmail.com>
Date: Sat, 23 Feb 2013 11:40:02 -0800
Message-ID: 
 <CAPwWofH3QLs-Tk-nr53MjwKcmUMdrG4F+00__xYFQyb8TdGdTg@mail.gmail.com>
Subject: Re: Getting custom input splits from files that are not byte-aligned
 or line-aligned
From: Public Network Services <publicnetworkservices@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=002354470c4c53235204d669796c

--002354470c4c53235204d669796c
Content-Type: text/plain; charset=ISO-8859-1

This appears to be the case.

My main issue is not reading the records (the library offers that
functionality) but putting them to splits.after reading (option 2 in my
original post).


On Sat, Feb 23, 2013 at 11:05 AM, Wellington Chevreuil <
wellington.chevreuil@gmail.com> wrote:

> Hi,
>
> I think you'll have to implement your own custom FileInputFormat, using
> this lib you mentioned to properly read your file records and split them
> through map tasks.
>
> Regards,
> Wellington.
> Em 23/02/2013 14:14, "Public Network Services" <
> publicnetworkservices@gmail.com> escreveu:
>
> Hi...
>>
>> I use an application that processes text files containing data records
>> which are of variable size and not line-aligned.
>>
>> The application implementation includes a Java library with a "reader"
>> object that can extract records one-by-one in a "pull" fashion, as strings,
>> i.e. for each such "reader" object the client code can call
>>
>> reader.next()
>>
>>
>> and get an entire record as a String. So, proceeding in this fashion, the
>> client code can consume a file of arbitrarily long length, from start to
>> end, whereupon a null value is returned.
>>
>> Another peculiarity is that the extracted record strings may lose some
>> secondary information (e.g., trim spaces), so exact byte alignment of the
>> records to the underlying data is not possible.
>>
>> How could the above code be used to efficiently split compliant text
>> files of large size (ranging from hundreds of megabytes to several
>> gigabytes and terrabytes in size)?
>>
>> The source code I have seen in FileInputFormat and numerous other
>> implementations is line or byte-aligned, so it is not applicable for the
>> above case.
>>
>> It would actually be very useful if there was a template implementation
>> that left only the string record "reader" object unspecified and did
>> everything else, but apparently there is none.
>>
>> Two alternatives that should work are:
>>
>>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB)
>>    and supply them to HDFS afterwards, returning false in the isSplitable()
>>    method of the custom InputFormat.
>>    2. Read and write records into HDFS files in the getSplits[] method
>>    of the custom InputFormat and create one FileSplit reference for each of
>>    these HDFS files, once they are filled to the desired size.
>>
>> Is there any better approach and/or any example code relevant to the
>> above?
>>
>> Thanks!
>>
>

--002354470c4c53235204d669796c
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

This appears to be the case.<div><br></div><div>My main issue is not readin=
g the records (the library offers that functionality) but putting them to s=
plits.after reading (option 2 in my original post).</div><div><br></div>
<div><br></div><div><div class=3D"gmail_quote">On Sat, Feb 23, 2013 at 11:0=
5 AM, Wellington Chevreuil <span dir=3D"ltr">&lt;<a href=3D"mailto:wellingt=
on.chevreuil@gmail.com" target=3D"_blank">wellington.chevreuil@gmail.com</a=
>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><p dir=3D"ltr">Hi,</p>
<p dir=3D"ltr">I think you&#39;ll have to implement your own custom FileInp=
utFormat, using this lib you mentioned to properly read your file records a=
nd split them through map tasks.</p>
<p dir=3D"ltr">Regards,<br>
Wellington.</p>
<div class=3D"gmail_quote">Em 23/02/2013 14:14, &quot;Public Network Servic=
es&quot; &lt;<a href=3D"mailto:publicnetworkservices@gmail.com" target=3D"_=
blank">publicnetworkservices@gmail.com</a>&gt; escreveu:<div><div class=3D"=
h5">
<br type=3D"attribution"><blockquote class=3D"gmail_quote" style=3D"margin:=
0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi...<div><br></div><div>I use an application that processes text files con=
taining data records which are of variable size and not line-aligned.</div>=
<div><br></div><div>The application implementation includes a Java library =
with a &quot;reader&quot; object that can extract records one-by-one in a &=
quot;pull&quot; fashion, as strings, i.e. for each such &quot;reader&quot; =
object the client code can call</div>


<div><br></div><div><blockquote style=3D"margin:0 0 0 40px;border:none;padd=
ing:0px"><div>reader.next()</div></blockquote></div><div><br></div><div>and=
 get an entire record as a String. So, proceeding in this fashion, the clie=
nt code can consume a file of arbitrarily long length, from start to end, w=
hereupon a null value is returned.</div>


<div><br></div><div>Another peculiarity is that the extracted record string=
s may lose some secondary information (e.g., trim spaces), so exact byte al=
ignment of the records to the underlying data is not possible.</div><div>


<br></div><div>How could the above code be used to efficiently split compli=
ant text files of large size (ranging from hundreds of megabytes to several=
 gigabytes and terrabytes in size)?</div>
<div><br></div><div>The source code I have seen in FileInputFormat and nume=
rous other implementations is line or byte-aligned, so it is not applicable=
 for the above case.</div><div><br></div><div>It would actually be very use=
ful if there was a template implementation that left only the string record=
 &quot;reader&quot; object unspecified and did everything else, but apparen=
tly there is none.</div>


<div><br></div><div>Two alternatives that should work are:</div><div><ol><l=
i>Split the files outside Hadoop (e.g., to sizes less than 64 MB) and suppl=
y them to HDFS afterwards, returning false in the isSplitable() method of t=
he custom InputFormat.</li>


<li>Read and write records into HDFS files in the getSplits[] method of the=
 custom InputFormat and create one FileSplit reference for each of these HD=
FS files, once they are filled to the desired size.</li>
</ol><div>Is there any better approach and/or any example code relevant to =
the above?</div></div><div><br></div><div>Thanks!</div>
</blockquote></div></div></div>
</blockquote></div><br></div>

--002354470c4c53235204d669796c--