hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuck Lam <chuck....@gmail.com>
Subject Re: InputFormat for fixed-width records?
Date Tue, 02 Jun 2009 04:40:23 GMT
Yes, it's totally possible for part of one record in the first file split
and the rest in the second file split. It's the job of the RecordReader to
make sure it's always reading in an entire record. Given a file split, your
RecordReader has to be able to skip over the first few bytes to get to the
first full record (if there's a partial record at the beginning). When it
reaches the end of the split, if there's a partial record there, it will go
get the rest of the record from the next split.

Tom's email earlier in this thread explained some of the details. Like he
said, look at LineRecordReader for inspiration. The logic for figuring out
the start of the first full record is in LineRecordReader itself. The
RecordReader can read the last record (that spans two file splits) without
any special logic because the Hadoop filesystem abstracts away file split
boundaries when reading.

On Mon, Jun 1, 2009 at 8:05 PM, Yabo-Arber Xu <arber.research@gmail.com>wrote:

> I have a follow-up question on this thread: How do we make sure that at the
> getFileSplit phase, there is no records that cross the boundary of
> different
> file splits?
> To explain my point better, for example, if each of my record is 100 bytes,
> would there be such case that there is some record whose key was put in the
> 1st filesplit, while its value was put in the second split?
> Best,
> Arber
> On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley <omalley@apache.org>
> wrote:
> > On May 28, 2009, at 5:15 AM, Stuart White wrote:
> >
> >  I need to process a dataset that contains text records of fixed length
> >> in bytes.  For example, each record may be 100 bytes in length
> >>
> >
> > The update to the terasort example has an InputFormat that does exactly
> > that. The key is 10 bytes and the value is the next 90 bytes. It is
> pretty
> > easy to write, but I should upload it soon. The output types are Text,
> but
> > they just have the binary data in them.
> >
> > -- Owen
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message