hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wenrui Guo" <wenrui....@ericsson.com>
Subject Large size Text file split
Date Wed, 10 Jun 2009 12:06:43 GMT
Hi, all

I have a large csv file ( larger than 10 GB ), I'd like to use a certain
InputFormat to split it into smaller part thus each Mapper can deal with
piece of the csv file. However, as far as I know, FileInputFormat only
cares about byte size of file, that is, the class can divide the csv
file as many part, and maybe some part is not a well-format CVS file.
For example, one line of the CSV file is not terminated with CRLF, or
maybe some text is trimed.

How to ensure each FileSplit is a smaller valid CSV file using a proper
InputFormat?

BR/anderson 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message