pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Coveney <jcove...@gmail.com>
Subject Re: Can you filter and load at the same time?
Date Wed, 01 Dec 2010 16:57:29 GMT
As always, a million thanks.

2010/12/1 Dmitriy Ryaboy <dvryaboy@gmail.com>

> 1) Pig (and hadoop) uses bash-style globbing. You can see the details here:
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29
>
> 2) Records are processed in a pipeline -- a record is read in, passed
> through all the operators in the given stage (map or reduce), and the
> output
> written to disk for the next stage to pick up. So if you load and then
> filter, the pipeline will be load->filter, and records will be discarded as
> they are read in, which I think is the behavior you are asking for.
>
> -D
>
> On Wed, Dec 1, 2010 at 7:57 AM, Jonathan Coveney <jcoveney@gmail.com>
> wrote:
>
> > In order to facilitate more robust loading, I have 2 questions.
> >
> > 1) I know that you can use some wildcards in loading... for example, if
> you
> > have 2 files, dog1.txt and dog2.txt, you can load dog*.txt and it will
> load
> > more. Is there any way to use regular expressions or anything more
> powerful
> > in the actual load? For example, if I want to load 10 different files
> with
> > a
> > generally similar name structure but identically structured data, what's
> > the
> > easiest and fastest way to load them all into the same table?
> > 2) Can you filter as you load? If you do a load then a filter right after
> > that, it seems wasteful (unless pig/hadoop are smart enough to realize
> that
> > it doesn't have to load all the data off the bat)
> >
> > I appreciate your help
> > Jon
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message