hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "pi song" <pi.so...@gmail.com>
Subject Re: Multiple file bag spilling
Date Mon, 31 Mar 2008 13:24:15 GMT
Dear Alan,

At first I thought the implementation doesn't look right but after I found
this comment, everything has become sensible.
" * The DataBag interface assumes that all data is written before any is
 * read.  That is, a DataBag cannot be used as a queue.  If data is written
 * after data is read, the results are undefined. "

Anyway thanks for your help,

On Sat, Mar 29, 2008 at 3:21 AM, Alan Gates <gates@yahoo-inc.com> wrote:

> I'm the one who wrote that code, so I'm the best one to explain it.
> What exactly were you wanting to know about it?
> Basically the idea is that files are sorted (and in the case of
> distinct, distinct applied) as each file is spilled.  Then at read
> time, the files are read back and merged via a priority queue.  In the
> case of distinct the distinct operator also has to be applied.
> This code is complicated by the fact that while reading in spilled
> files, there may still be entries in memory.  It is also possible to
> have what was in memory (and already partially read) spilled in between
> reads.  So the iterator code has to handle merging in results from
> memory, and if we were reading from memory and got spilled, making sure
> we start reading again from the correct point in the newly spilled
> file.  This is made a little easier by the fact that data bags are
> written entirely before they are read, so there will be at most one
> spill during a read.
> Hopefully that helps as an introduction.  If you have specific
> questions I'm glad to answer them.
> Alan.
> On Mar 26, 2008, at 7:12 AM, pi song wrote:
> > Dear Ben or anyone who knows,
> >
> > Can you please explain me how multiple files spilling works in sorted
> > bag/distinct bag?
> >
> > Cheers,
> > Pi

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message