hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Public Network Services <publicnetworkservi...@gmail.com>
Subject Re: Getting custom input splits from files that are not byte-aligned or line-aligned
Date Sat, 23 Feb 2013 19:40:02 GMT
This appears to be the case.

My main issue is not reading the records (the library offers that
functionality) but putting them to splits.after reading (option 2 in my
original post).

On Sat, Feb 23, 2013 at 11:05 AM, Wellington Chevreuil <
wellington.chevreuil@gmail.com> wrote:

> Hi,
> I think you'll have to implement your own custom FileInputFormat, using
> this lib you mentioned to properly read your file records and split them
> through map tasks.
> Regards,
> Wellington.
> Em 23/02/2013 14:14, "Public Network Services" <
> publicnetworkservices@gmail.com> escreveu:
> Hi...
>> I use an application that processes text files containing data records
>> which are of variable size and not line-aligned.
>> The application implementation includes a Java library with a "reader"
>> object that can extract records one-by-one in a "pull" fashion, as strings,
>> i.e. for each such "reader" object the client code can call
>> reader.next()
>> and get an entire record as a String. So, proceeding in this fashion, the
>> client code can consume a file of arbitrarily long length, from start to
>> end, whereupon a null value is returned.
>> Another peculiarity is that the extracted record strings may lose some
>> secondary information (e.g., trim spaces), so exact byte alignment of the
>> records to the underlying data is not possible.
>> How could the above code be used to efficiently split compliant text
>> files of large size (ranging from hundreds of megabytes to several
>> gigabytes and terrabytes in size)?
>> The source code I have seen in FileInputFormat and numerous other
>> implementations is line or byte-aligned, so it is not applicable for the
>> above case.
>> It would actually be very useful if there was a template implementation
>> that left only the string record "reader" object unspecified and did
>> everything else, but apparently there is none.
>> Two alternatives that should work are:
>>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB)
>>    and supply them to HDFS afterwards, returning false in the isSplitable()
>>    method of the custom InputFormat.
>>    2. Read and write records into HDFS files in the getSplits[] method
>>    of the custom InputFormat and create one FileSplit reference for each of
>>    these HDFS files, once they are filled to the desired size.
>> Is there any better approach and/or any example code relevant to the
>> above?
>> Thanks!

View raw message