pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@gmail.com>
Subject Re: Easy question...difference between this::form and this.form?
Date Tue, 07 Dec 2010 17:49:29 GMT
Consider self-joins, with regards to the meaningful name problem...

The sql way to deal with this issue is essentially to keep the name of the
parent relation around during parsing, and require that you explicitly
provide the desired parent if column names are ambiguous. That's probably
something that could be implemented now that we have the required metadata
in the operators (I believe it wasn't there when the disambiguation design
was implemented).

As far as difference between "::" and ".".  The double-colon is just a
string with no special meaning, it's simply part of the field name. The
period is essentially a projection operator -- you are saying, "the thing to
the left of the period is a tuple, and the thing to the right is a field in
that tuple". (works for bags as well, in which case it means, the thing to
the left of the period is a bag of tuples, and the thing to the right is a
field in every tuple in the bag)

-Dmitriy.

2010/12/7 Anze <anzenews@volja.net>

>
> If one uses meaningful names then Pig would never use '::' anyway. The
> problem
> is when you use multiple joins in sequence, then '::' names get very
> annoying.
> But that's just my opinion. :)
>
> Anze
>
>
> On Tuesday 07 December 2010, Jonathan Coveney wrote:
> > Would that even be much better? It seems like it'd be better to have it
> be
> > consistent in appending the whatever::, so that at least you have to be
> > cognizant of it when you do the join. If it starts being too clever, then
> > it's up to you to figure out when it does and doesn't do it which might
> be
> > annoying.
> >
> > 2010/12/7 Anze <anzenews@volja.net>
> >
> > > I understand the reason for this, it just seems like a drastic
> solution.
> > > :)
> > >
> > > Ideally, Pig should be clever enough to detect ambiguity and deal with
> > > it, and
> > > leave the non-conflicting names intact. For instance:
> > >
> > > A = load 'foo' as (x, y, z);
> > > B = load 'bar' as (x, a, b, c);
> > > C = join A by x, B by x;
> > > DESCRIBE C;
> > > C: {A::x, y, z, B::x, a, b, c}
> > >
> > > or even:
> > > C: {x, y, z, B::x, a, b, c}
> > >
> > > or even a step further, in case of JOIN:
> > > C: {x, y, z, a, b, c}
> > > (since join *joins* by x, why would there be two? This doesn't always
> > > work for
> > > other operations, of course)
> > >
> > > Reasoning: at least in my cases the names are descriptive from the
> start,
> > > therefore there are almost no name conflicts. In rare cases where there
> > > are Pig can determine that and use old syntax with "::", then let me
> > > deal with it.
> > >
> > > I know this is backwards-incompatible change and is not likely to be
> > > accepted,
> > > but still... :)
> > >
> > > Anze
> > >
> > > On Monday 06 December 2010, Alan Gates wrote:
> > > > The reason it's needed is that ambiguities would result otherwise.
> > > >
> > > > A = load 'foo' as (x, y, z);
> > > > B = load 'bar' as (w, x, y, z);
> > > > C = join A by x, B by x;
> > > > D = filter C by z > 0;  -- which z?
> > > >
> > > > As long as the name is not ambiguous, the :: is not required.  So in
> > > > the above example it would be perfectly legal to say
> > > >
> > > > D = filter C by w > 0;
> > > >
> > > > Out of curiosity, why do you want to remove the :: names?
> > > >
> > > > Alan.
> > > >
> > > > On Dec 6, 2010, at 1:05 PM, Jonathan Coveney wrote:
> > > > > Hijack away. I would be curious as to the reason we need this as
> > > > > well.
> > > > >
> > > > > 2010/12/6 Anze <anzenews@volja.net>
> > > > >
> > > > >> Sorry to hijack your question, Jonathan, but while we are at
it...
> > > > >> :)
> > > > >>
> > > > >> Is there a way to tell Pig NOT to add "base_alias::"? Almost
half
> > > > >> my code
> > > > >> consists of FOREACH... GENERATE that just remove these prefixes.
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Anze
> > > > >>
> > > > >> On Monday 06 December 2010, Daniel Dai wrote:
> > > > >>> After join, cross, foreach flatten, Pig will automatically
add
> > > > >>> "base_alias::" prefix. All other cases use "."
> > > > >>>
> > > > >>> Daniel
> > > > >>>
> > > > >>> Jonathan Coveney wrote:
> > > > >>>> It's very hard to search for this among the docs because
it's so
> > > > >>
> > > > >> generic,
> > > > >>
> > > > >>>> so I thought I'd ask... I'm sure the answer is painfully
easy.
> > > > >>>>
> > > > >>>> Taking a look at this code that I found online, for example
> > > > >>>>
> > > > >>>> --
> > > > >>>> -- Read in a bag of tuples (timeseries for this example)
and
> > > > >>>> divide the
> > > > >>>> -- numeric column by its maximum.
> > > > >>>> --
> > > > >>>> %default DATABAG 'data/timeseries.tsv'
> > > > >>>>
> > > > >>>> data       = LOAD '$DATABAG' AS (month:chararray, count:int);
> > > > >>>> accumulate = GROUP data ALL;
> > > > >>>> calc_max   = FOREACH accumulate GENERATE FLATTEN(data),
> > > > >>>> MAX(data.count) AS max_count;
> > > > >>>> normalize  = FOREACH calc_max GENERATE data::month AS
month,
> > > > >>>> data::count AS count, (float)data::count / (float)max_count
AS
> > > > >>>> normed_count;
> > > > >>>> DUMP normalize;
> > > > >>>>
> > > > >>>> What purpose does data::month serve versus data.count?
> > > > >>>>
> > > > >>>> Thanks
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message