hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yabo-Arber Xu <arber.resea...@gmail.com>
Subject Re: InputFormat for fixed-width records?
Date Tue, 02 Jun 2009 05:32:24 GMT
Thanks for your reply. It clarifies a lot. The place i was not so sure is
how to read the last record in a split, but now it seems there is no problem
as filesystem has done it for me. :-)

On Tue, Jun 2, 2009 at 12:40 PM, Chuck Lam <chuck.lam@gmail.com> wrote:

> Yes, it's totally possible for part of one record in the first file split
> and the rest in the second file split. It's the job of the RecordReader to
> make sure it's always reading in an entire record. Given a file split, your
> RecordReader has to be able to skip over the first few bytes to get to the
> first full record (if there's a partial record at the beginning). When it
> reaches the end of the split, if there's a partial record there, it will go
> get the rest of the record from the next split.
>
> Tom's email earlier in this thread explained some of the details. Like he
> said, look at LineRecordReader for inspiration. The logic for figuring out
> the start of the first full record is in LineRecordReader itself. The
> RecordReader can read the last record (that spans two file splits) without
> any special logic because the Hadoop filesystem abstracts away file split
> boundaries when reading.
>
>
>
> On Mon, Jun 1, 2009 at 8:05 PM, Yabo-Arber Xu <arber.research@gmail.com
> >wrote:
>
> > I have a follow-up question on this thread: How do we make sure that at
> the
> > getFileSplit phase, there is no records that cross the boundary of
> > different
> > file splits?
> >
> > To explain my point better, for example, if each of my record is 100
> bytes,
> > would there be such case that there is some record whose key was put in
> the
> > 1st filesplit, while its value was put in the second split?
> >
> > Best,
> > Arber
> >
> > On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley <omalley@apache.org>
> > wrote:
> >
> > > On May 28, 2009, at 5:15 AM, Stuart White wrote:
> > >
> > >  I need to process a dataset that contains text records of fixed length
> > >> in bytes.  For example, each record may be 100 bytes in length
> > >>
> > >
> > > The update to the terasort example has an InputFormat that does exactly
> > > that. The key is 10 bytes and the value is the next 90 bytes. It is
> > pretty
> > > easy to write, but I should upload it soon. The output types are Text,
> > but
> > > they just have the binary data in them.
> > >
> > > -- Owen
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message