hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wenrui Guo" <wenrui....@ericsson.com>
Subject RE: Large size Text file split
Date Thu, 11 Jun 2009 04:02:59 GMT
I don't understand the internal logic of the FileSplit and Mapper.

By my understanding, I think FileInputFormat is the actual class that
takes care of the file spliting. So it's reasonable if one large file is
splited into 5 smaller parts, each parts is less than 2GB (since we
specify the numberOfSplit is 5).

However, the FileSplit is rough edges, so mapper 1 which takes the split
1 as input omit the incomplete parts at the end of split 1, then mapper
2 will continue to read that incomplete part then add the remaining part
of split 2?

Take this as example:

The original file is:

1::122::5::838985046 (CRLF)
1::185::5::838983525 (CRLF)
1::231::5::838983392 (CRLF)

Assume number of split is 2, then the above content is divied into two
part:

Split 1:
1::122::5::838985046 (CRLF)
1::185::5::8
 

Split 2:
38983525 (CRLF)
1::231::5::838983392 (CRLF)

Afterwards, Mapper 1 takes split 1 as input, but after eat the line
1::122::5::838985046, it found the remaining part is not a complete
record, then Mapper 1 bypass it, but Mapper 2 will read this and add it
ahead of first line of Split 2 to compose a valid record.

Is it correct ? If it is, which class implements the above logic?

BR/anderson

-----Original Message-----
From: Aaron Kimball [mailto:aaron@cloudera.com] 
Sent: Thursday, June 11, 2009 11:49 AM
To: core-user@hadoop.apache.org
Subject: Re: Large size Text file split

The FileSplit boundaries are "rough" edges -- the mapper responsible for
the previous split will continue until it finds a full record, and the
next mapper will read ahead and only start on the first record boundary
after the byte offset.
- Aaron

On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo <wenrui.guo@ericsson.com>
wrote:

> I think the default TextInputFormat can meet my requirement. However, 
> even if the JavaDoc of TextInputFormat says the TextInputFormat could 
> divide input file as text lines which ends with CRLF. But I'd like to 
> know if the FileSplit size is not N times of line length, what will be

> happen eventually?
>
> BR/anderson
>
> -----Original Message-----
> From: jason hadoop [mailto:jason.hadoop@gmail.com]
> Sent: Wednesday, June 10, 2009 8:39 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Large size Text file split
>
> There is always NLineInputFormat. You specify the number of lines per 
> split.
> The key is the position of the line start in the file, value is the 
> line itself.
> The parameter mapred.line.input.format.linespermap controls the number

> of lines per split
>
> On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi < 
> harish.mallipeddi@gmail.com> wrote:
>
> > On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo 
> > <wenrui.guo@ericsson.com>
> > wrote:
> >
> > > Hi, all
> > >
> > > I have a large csv file ( larger than 10 GB ), I'd like to use a 
> > > certain InputFormat to split it into smaller part thus each Mapper

> > > can deal with piece of the csv file. However, as far as I know, 
> > > FileInputFormat only cares about byte size of file, that is, the 
> > > class can divide the csv file as many part, and maybe some part is
> not a well-format CVS file.
> > > For example, one line of the CSV file is not terminated with CRLF,

> > > or maybe some text is trimed.
> > >
> > > How to ensure each FileSplit is a smaller valid CSV file using a 
> > > proper InputFormat?
> > >
> > > BR/anderson
> > >
> >
> > If all you care about is the splits occurring at line boundaries, 
> > then
>
> > TextInputFormat will work.
> >
> > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/map
> > re
> > d/TextInputFormat.html
> >
> > If not I guess you can write your own InputFormat class.
> >
> > --
> > Harish Mallipeddi
> > http://blog.poundbang.in
> >
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>

Mime
View raw message