hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wellington Chevreuil <wellington.chevre...@gmail.com>
Subject Re: Getting custom input splits from files that are not byte-aligned or line-aligned
Date Sat, 23 Feb 2013 19:05:30 GMT

I think you'll have to implement your own custom FileInputFormat, using
this lib you mentioned to properly read your file records and split them
through map tasks.

Em 23/02/2013 14:14, "Public Network Services" <
publicnetworkservices@gmail.com> escreveu:

> Hi...
> I use an application that processes text files containing data records
> which are of variable size and not line-aligned.
> The application implementation includes a Java library with a "reader"
> object that can extract records one-by-one in a "pull" fashion, as strings,
> i.e. for each such "reader" object the client code can call
> reader.next()
> and get an entire record as a String. So, proceeding in this fashion, the
> client code can consume a file of arbitrarily long length, from start to
> end, whereupon a null value is returned.
> Another peculiarity is that the extracted record strings may lose some
> secondary information (e.g., trim spaces), so exact byte alignment of the
> records to the underlying data is not possible.
> How could the above code be used to efficiently split compliant text files
> of large size (ranging from hundreds of megabytes to several gigabytes and
> terrabytes in size)?
> The source code I have seen in FileInputFormat and numerous other
> implementations is line or byte-aligned, so it is not applicable for the
> above case.
> It would actually be very useful if there was a template implementation
> that left only the string record "reader" object unspecified and did
> everything else, but apparently there is none.
> Two alternatives that should work are:
>    1. Split the files outside Hadoop (e.g., to sizes less than 64 MB) and
>    supply them to HDFS afterwards, returning false in the isSplitable() method
>    of the custom InputFormat.
>    2. Read and write records into HDFS files in the getSplits[] method of
>    the custom InputFormat and create one FileSplit reference for each of these
>    HDFS files, once they are filled to the desired size.
> Is there any better approach and/or any example code relevant to the above?
> Thanks!

View raw message