pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hc busy <hc.b...@gmail.com>
Subject Re: What should FLATTEN do?
Date Fri, 02 Apr 2010 21:33:12 GMT
Okay guys some details after some digging. We've got this version of  pig
from CDH2 installed:

hadoop-pig-0.5.0+11.1-1


the list of patches that they applied on top of 0.5.0 are listed here:

http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt

<http://archive.cloudera.com/cdh/2/pig-0.5.0+11.1.CHANGES.txt>The patches
listed there doesn't seem to deal with FLATTEN in any way.

Any suggestions?




On Fri, Apr 2, 2010 at 1:49 PM, hc busy <hc.busy@gmail.com> wrote:

>
> .... yeah, you have to implement outputSchema() method on the udf in order
> to make the content of the tuple visible... There's a nice example in the
> UDF Manual
>
> http://hadoop.apache.org/pig/docs/r0.6.0/udf.html
>
> <http://hadoop.apache.org/pig/docs/r0.6.0/udf.html>search for 'package
> myudf' until u find it.
>
>
>
> On Fri, Apr 2, 2010 at 12:52 PM, Russell Jurney <russell.jurney@gmail.com>wrote:
>
>> Not sure if this is exactly the same, but when I've created tuples within
>> tuples in UDFs (to preserve order of pairs), from bag input, Pig has
>> allowed
>> it - but I can't work with that data in subsequent steps.
>>
>> On Fri, Apr 2, 2010 at 12:37 PM, hc busy <hc.busy@gmail.com> wrote:
>>
>> > Yeah, I'm sure it has nested tuples. Pig doesn't natively support
>> > introduction of tuples
>> >
>> > h = foreach g generate ((x,y,z)), (x), ((((x))))
>> >
>> > doesn't work, but i have a udf that does that.... don't ask why...., and
>> > I've seen it print double pair of paren's when I took a dump.
>> >
>> > Our hadoop guys here says it's CDH2 and that the "upgrade" was just
>> > re-installation of CDH2... ("same jars") But certainly my script
>> suddenly
>> > started doing weird things when it flattened that all the way through.
>> >
>> > I'd support the prior behavior as well, because that seems to match my
>> > reading of documentation on behavior of FLATTEN.
>> >
>> >
>> >
>> > Has anybody else had this problem with recent cloudera/pig versions?
>> >
>> >
>> > thnx!!
>> >
>> >
>> > On Fri, Apr 2, 2010 at 11:43 AM, zaki rahaman <zaki.rahaman@gmail.com
>> > >wrote:
>> >
>> > > Stupid question but are you sure your bag has the dual sets of
>> > parentheses?
>> > > (And if I may ask, why is that the case?)
>> > >
>> > > On Fri, Apr 2, 2010 at 2:11 PM, zaki rahaman <zaki.rahaman@gmail.com>
>> > > wrote:
>> > >
>> > > > If I'm not mistaken, the output is the expected behavior. Flatten
>> > should
>> > > > unnest bags. I'm assuming your statement is something like FOREACH
>> ...
>> > > > GENERATE field1, field2, FLATTEN(bag1) which would 'duplicate' the
>> > first
>> > > two
>> > > > fields of a tuple for every tuple in the nested bag.
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Apr 2, 2010 at 2:02 PM, hc busy <hc.busy@gmail.com>
wrote:
>> > > >
>> > > >> doh!!!! s/map/bag/g
>> > > >>
>> > > >> I seem to get maps and bags mixed up or some reason...
>> > > >>
>> > > >> Guys, I have a row containing a *bag*
>> > > >>
>> > > >> 'id','data', {((1,2)), ((2,3)), ((4,5))}
>> > > >>
>> > > >> What is the expected behavior when I flatten on that bag? I had
>> > expected
>> > > >> it
>> > > >> to result in
>> > > >>
>> > > >> 'id','data', (1,2)
>> > > >> 'id','data', (2,3)
>> > > >> 'id','data', (4,5)
>> > > >>
>> > > >>
>> > > >> But it appears to me that the result of applying FLATTEN to that
>> bag
>> > is
>> > > >> this
>> > > >> instead:
>> > > >>
>> > > >> 'id','data', 1,2
>> > > >> 'id','data', 2,3
>> > > >> 'id','data', 4,5
>> > > >>
>> > > >>
>> > > >> The latter is returned by the current cloudera's CDH2 and I've
seen
>> > the
>> > > >> prior behavior on other versions of pig.
>> > > >>
>> > > >> Which is the correct behavior by design?
>> > > >>
>> > > >> What will pig 0.6 do when it is released?
>> > > >>
>> > > >> thanks!
>> > > >> On Fri, Apr 2, 2010 at 11:29 AM, hc busy <hc.busy@gmail.com>
>> wrote:
>> > > >>
>> > > >> > Guys, I have a row containing a map
>> > > >> >
>> > > >> > 'id','data', {((1,2)), ((2,3)), ((4,5))}
>> > > >> >
>> > > >> > What is the expected behavior when I flatten on that bag?
I had
>> > > expected
>> > > >> it
>> > > >> > to result in
>> > > >> >
>> > > >> > 'id','data', (1,2)
>> > > >> > 'id','data', (2,3)
>> > > >> > 'id','data', (4,5)
>> > > >> >
>> > > >> >
>> > > >> > But it appears to me that the result of applying FLATTEN
to that
>> bag
>> > > is
>> > > >> > this instead:
>> > > >> >
>> > > >> > 'id','data', 1,2
>> > > >> > 'id','data', 2,3
>> > > >> > 'id','data', 4,5
>> > > >> >
>> > > >> >
>> > > >> > The latter is returned by the current cloudera's CDH2 and
I've
>> seen
>> > > the
>> > > >> > prior behavior on other versions of pig.
>> > > >> >
>> > > >> > Which is the correct behavior by design?
>> > > >> >
>> > > >> > What will pig 0.6 do when it is released?
>> > > >> >
>> > > >> > thanks!
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Zaki Rahaman
>> > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > Zaki Rahaman
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message