pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Santhosh Srinivasan <...@yahoo-inc.com>
Subject RE: Easy question...difference between this::form and this.form?
Date Wed, 08 Dec 2010 20:02:58 GMT
Unambiguous column names can be accessed as is without the :: An example that demonstrates
it follows:

grunt> a = load 'a' as (x, y, XX);
grunt> b = load 'b' as (x, y, YY);
grunt> c = load 'c' as (x,y, ZZ); 
grunt> d = join a by $0, b by $0; 
grunt> describe d;
d: {a::x: bytearray,a::y: bytearray,a::XX: bytearray,b::x: bytearray,b::y: bytearray,b::YY:
bytearray}
grunt> e = join d by $0, c by $0; 
grunt> describe e;
e: {d::a::x: bytearray,d::a::y: bytearray,d::a::XX: bytearray,d::b::x: bytearray,d::b::y:
bytearray,d::b::YY: bytearray,c::x: bytearray,c::y: bytearray,c::ZZ: bytearray}

grunt> f = foreach e generate XX;
-------------------------------------------^^^
grunt> describe f;
f: {d::a::XX: bytearray} 

-----Original Message-----
From: Anze [mailto:anzenews@volja.net] 
Sent: Wednesday, December 08, 2010 12:24 AM
To: user@pig.apache.org
Subject: Re: Easy question...difference between this::form and this.form?


I'm curious - is this a problem for others as well? Do you keep 'A::C::myId' 
or do you use FOREACH... GENERATE after each JOIN?

About possible workarounds:
Is it possible to write an UDF that would automatically strip 'X::' from the start of the
names? For instance:
C: {A::x, A::y, B::x, B::v}
C = FLATTEN_NAMES(C, 'x');
C: {x, y, v}
('x' is the name of the column on which JOIN was made, if it is the same in A and B) Can sth.
like this be done with UDFs?
(I admit it's ugly, but... ;)

Another way would be to add an argument to the JOIN (& co.), telling it to use flat names
and to fail with error if the names are ambiguous:
C = JOIN A by x, B by x FLATTEN_NAMES;
C: {x, y, v}

Anze


On Wednesday 08 December 2010, Dmitriy Ryaboy wrote:
> it's sort of true -- but, iirc, only goes one level deep, so once you 
> do a second join, you are stuck with "::"s
> 
> On Tue, Dec 7, 2010 at 10:11 AM, Santhosh Srinivasan <sms@yahoo-
inc.com>wrote:
> > > The sql way to deal with this issue is essentially to keep the 
> > > name of
> > 
> > the parent relation
> > 
> > > around during parsing, and require that you explicitly provide the
> > 
> > desired parent if column
> > 
> > > names are ambiguous. That's probably something that could be 
> > > implemented
> > 
> > now that we have
> > 
> > > the required metadata in the operators (I believe it wasn't there 
> > > when
> > 
> > the disambiguation
> > 
> > > design was implemented).
> > 
> > Isn't that true today? Unambiguous columns can be referenced without 
> > the
> > :: operator.
> > 
> > Santhosh
> > 
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > Sent: Tuesday, December 07, 2010 9:49 AM
> > To: user@pig.apache.org
> > Subject: Re: Easy question...difference between this::form and this.form?
> > 
> > Consider self-joins, with regards to the meaningful name problem...
> > 
> > The sql way to deal with this issue is essentially to keep the name 
> > of the parent relation around during parsing, and require that you 
> > explicitly provide the desired parent if column names are ambiguous.
> > That's probably something that could be implemented now that we have 
> > the required metadata in the operators (I believe it wasn't there 
> > when the disambiguation design was implemented).
> > 
> > As far as difference between "::" and ".".  The double-colon is just 
> > a string with no special meaning, it's simply part of the field 
> > name. The period is essentially a projection operator -- you are 
> > saying, "the thing to the left of the period is a tuple, and the 
> > thing to the right is a field in that tuple". (works for bags as 
> > well, in which case it means, the thing to the left of the period is 
> > a bag of tuples, and the thing to the right is a field in every 
> > tuple in the bag)
> > 
> > -Dmitriy.
> > 
> > 2010/12/7 Anze <anzenews@volja.net>
> > 
> > > If one uses meaningful names then Pig would never use '::' anyway. 
> > > The problem is when you use multiple joins in sequence, then '::' 
> > > names get very annoying.
> > > But that's just my opinion. :)
> > > 
> > > Anze
> > > 
> > > On Tuesday 07 December 2010, Jonathan Coveney wrote:
> > > > Would that even be much better? It seems like it'd be better to 
> > > > have it
> > > 
> > > be
> > > 
> > > > consistent in appending the whatever::, so that at least you 
> > > > have to be cognizant of it when you do the join. If it starts 
> > > > being too clever, then it's up to you to figure out when it does 
> > > > and doesn't do it which might
> > > 
> > > be
> > > 
> > > > annoying.
> > > > 
> > > > 2010/12/7 Anze <anzenews@volja.net>
> > > > 
> > > > > I understand the reason for this, it just seems like a drastic
> > > 
> > > solution.
> > > 
> > > > > :)
> > > > > 
> > > > > Ideally, Pig should be clever enough to detect ambiguity and 
> > > > > deal with it, and leave the non-conflicting names intact. For instance:
> > > > > 
> > > > > A = load 'foo' as (x, y, z);
> > > > > B = load 'bar' as (x, a, b, c); C = join A by x, B by x; 
> > > > > DESCRIBE C;
> > > > > C: {A::x, y, z, B::x, a, b, c}
> > > > > 
> > > > > or even:
> > > > > C: {x, y, z, B::x, a, b, c}
> > > > > 
> > > > > or even a step further, in case of JOIN:
> > > > > C: {x, y, z, a, b, c}
> > > > > (since join *joins* by x, why would there be two? This doesn't 
> > > > > always work for other operations, of course)
> > > > > 
> > > > > Reasoning: at least in my cases the names are descriptive from 
> > > > > the
> > > 
> > > start,
> > > 
> > > > > therefore there are almost no name conflicts. In rare cases 
> > > > > where there are Pig can determine that and use old syntax with 
> > > > > "::", then let me deal with it.
> > > > > 
> > > > > I know this is backwards-incompatible change and is not likely 
> > > > > to be accepted, but still... :)
> > > > > 
> > > > > Anze
> > > > > 
> > > > > On Monday 06 December 2010, Alan Gates wrote:
> > > > > > The reason it's needed is that ambiguities would result 
> > > > > > otherwise.
> > > > > > 
> > > > > > A = load 'foo' as (x, y, z); B = load 'bar' as (w, x, y, z);

> > > > > > C = join A by x, B by x; D = filter C by z > 0;  -- which
z?
> > > > > > 
> > > > > > As long as the name is not ambiguous, the :: is not required.
> > > > > > So in the above example it would be perfectly legal to say
> > > > > > 
> > > > > > D = filter C by w > 0;
> > > > > > 
> > > > > > Out of curiosity, why do you want to remove the :: names?
> > > > > > 
> > > > > > Alan.
> > > > > > 
> > > > > > On Dec 6, 2010, at 1:05 PM, Jonathan Coveney wrote:
> > > > > > > Hijack away. I would be curious as to the reason we need

> > > > > > > this as well.
> > > > > > > 
> > > > > > > 2010/12/6 Anze <anzenews@volja.net>
> > > > > > > 
> > > > > > >> Sorry to hijack your question, Jonathan, but while
we are 
> > > > > > >> at
> > 
> > it...
> > 
> > > > > > >> :)
> > > > > > >> 
> > > > > > >> Is there a way to tell Pig NOT to add "base_alias::"?

> > > > > > >> Almost half my code consists of FOREACH... GENERATE
that 
> > > > > > >> just remove these prefixes.
> > > > > > >> 
> > > > > > >> Thanks,
> > > > > > >> 
> > > > > > >> Anze
> > > > > > >> 
> > > > > > >> On Monday 06 December 2010, Daniel Dai wrote:
> > > > > > >>> After join, cross, foreach flatten, Pig will 
> > > > > > >>> automatically add "base_alias::" prefix. All other
cases use "."
> > > > > > >>> 
> > > > > > >>> Daniel
> > > > > > >>> 
> > > > > > >>> Jonathan Coveney wrote:
> > > > > > >>>> It's very hard to search for this among the
docs 
> > > > > > >>>> because it's so
> > > > > > >> 
> > > > > > >> generic,
> > > > > > >> 
> > > > > > >>>> so I thought I'd ask... I'm sure the answer
is 
> > > > > > >>>> painfully easy.
> > > > > > >>>> 
> > > > > > >>>> Taking a look at this code that I found online,
for 
> > > > > > >>>> example
> > > > > > >>>> 
> > > > > > >>>> --
> > > > > > >>>> -- Read in a bag of tuples (timeseries for
this 
> > > > > > >>>> example) and divide the
> > > > > > >>>> -- numeric column by its maximum.
> > > > > > >>>> --
> > > > > > >>>> %default DATABAG 'data/timeseries.tsv'
> > > > > > >>>> 
> > > > > > >>>> data       = LOAD '$DATABAG' AS (month:chararray,
> > > > > > >>>> count:int); accumulate = GROUP data ALL;
> > > > > > >>>> calc_max   = FOREACH accumulate GENERATE FLATTEN(data),
> > > > > > >>>> MAX(data.count) AS max_count; normalize  =
FOREACH 
> > > > > > >>>> calc_max GENERATE data::month AS month, data::count
AS 
> > > > > > >>>> count, (float)data::count / (float)max_count
AS 
> > > > > > >>>> normed_count; DUMP normalize;
> > > > > > >>>> 
> > > > > > >>>> What purpose does data::month serve versus
data.count?
> > > > > > >>>> 
> > > > > > >>>> Thanks


Mime
View raw message