hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olston <ols...@yahoo-inc.com>
Subject Re: possible use of Pig for OLAP
Date Tue, 20 Nov 2007 18:56:24 GMT
Exactly. You can write "STORE X" for each handle X that you want a  
result for.

The only issue is that it will create a separate execution job for  
each STORE command.

If you don't want to pay for doing it in multiple jobs, you could  
imagine adding a "side store" function to Pig, so that it can store  
side files but keep processing the "main" program.

It's possible that this can be accomplished today via the SPLIT  
command -- anyone care to comment?

-Chris

On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote:

>
> Can you just explicitly save those intermediate results?
>
>
> On 11/20/07 10:31 AM, "Andrzej Bialecki" <ab@getopt.org> wrote:
>
>> Chris Olston wrote:
>>> Sounds interesting. Pig is geared toward large-scale aggregation
>>> operations, in the style of OLAP.
>>>
>>> Regarding your 3rd paragraph question, do you mean:
>>>
>>> a) there are several interrelated aggregation expressions that  
>>> you want
>>> evaluated in just one pass over the data, or
>>> b) you do some initial aggregation, display it to the user, who  
>>> can do
>>> "drill-down" operations in the GUI which require you to look up more
>>> data in the backend
>>>
>>> ?
>>>
>>> For (a), yes Pig can do that, although currently you have to  
>>> encode it
>>> explicitly as a single Pig program (in future versions, we might  
>>> be able
>>> to take multiple related Pig programs and execute them in a joint
>>> fashion). For (b), we don't currently have a mechanism to do that
>>> without reloading the data, although perhaps the operating  
>>> system's file
>>> cache would help with that, under the covers, if the file  
>>> partitions fit
>>> in memory and don't get evicted.
>>
>> Would it be possible to modify Pig (and underlying local/mapreduce  
>> impl)
>> so that if a specific syntax is used then an intermediate result  
>> is also
>> stored into a temporary file? This way, on the first dump/store Pig
>> would produce all intermediate results, then keep some of them, and
>> re-use them for subsequent operators?
>>
>> Example - let's say that ':=' means that the result should be kept
>> around until exit (or until any of previous intermediate results  
>> changes):
>>
>> -- A is not persisted
>> A = load 'sample.txt' as (date, time, ip, query);
>> -- B is to be persisted in a temp file
>> B := group A by ip;
>> -- compile & execute - creates B in a temp file
>> dump B;
>> C = foreach B generate group, query;
>> -- this uses already existing B data from a temp file
>> dump C;
>>
>

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message