hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harish Mallipeddi <harish.mallipe...@gmail.com>
Subject Re: Large size Text file split
Date Wed, 10 Jun 2009 12:27:41 GMT
On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo <wenrui.guo@ericsson.com> wrote:

> Hi, all
> I have a large csv file ( larger than 10 GB ), I'd like to use a certain
> InputFormat to split it into smaller part thus each Mapper can deal with
> piece of the csv file. However, as far as I know, FileInputFormat only
> cares about byte size of file, that is, the class can divide the csv
> file as many part, and maybe some part is not a well-format CVS file.
> For example, one line of the CSV file is not terminated with CRLF, or
> maybe some text is trimed.
> How to ensure each FileSplit is a smaller valid CSV file using a proper
> InputFormat?
> BR/anderson

If all you care about is the splits occurring at line boundaries, then
TextInputFormat will work.

If not I guess you can write your own InputFormat class.

Harish Mallipeddi

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message