pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@gmail.com>
Subject Re: Can you filter and load at the same time?
Date Wed, 01 Dec 2010 16:37:27 GMT
1) Pig (and hadoop) uses bash-style globbing. You can see the details here:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

2) Records are processed in a pipeline -- a record is read in, passed
through all the operators in the given stage (map or reduce), and the output
written to disk for the next stage to pick up. So if you load and then
filter, the pipeline will be load->filter, and records will be discarded as
they are read in, which I think is the behavior you are asking for.

-D

On Wed, Dec 1, 2010 at 7:57 AM, Jonathan Coveney <jcoveney@gmail.com> wrote:

> In order to facilitate more robust loading, I have 2 questions.
>
> 1) I know that you can use some wildcards in loading... for example, if you
> have 2 files, dog1.txt and dog2.txt, you can load dog*.txt and it will load
> more. Is there any way to use regular expressions or anything more powerful
> in the actual load? For example, if I want to load 10 different files with
> a
> generally similar name structure but identically structured data, what's
> the
> easiest and fastest way to load them all into the same table?
> 2) Can you filter as you load? If you do a load then a filter right after
> that, it seems wasteful (unless pig/hadoop are smart enough to realize that
> it doesn't have to load all the data off the bat)
>
> I appreciate your help
> Jon
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message