hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Utkarsh Srivastava <utka...@yahoo-inc.com>
Subject Re: possible use of Pig for OLAP
Date Tue, 20 Nov 2007 19:27:47 GMT
The current implementation of SPLIT will be no more efficient that  
explicitly calling STORE.

Utkarsh


On Nov 20, 2007, at 10:56 AM, Chris Olston wrote:

> Exactly. You can write "STORE X" for each handle X that you want a  
> result for.
>
> The only issue is that it will create a separate execution job for  
> each STORE command.
>
> If you don't want to pay for doing it in multiple jobs, you could  
> imagine adding a "side store" function to Pig, so that it can store  
> side files but keep processing the "main" program.
>
> It's possible that this can be accomplished today via the SPLIT  
> command -- anyone care to comment?
>
> -Chris
>
> On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote:
>
>>
>> Can you just explicitly save those intermediate results?
>>
>>
>> On 11/20/07 10:31 AM, "Andrzej Bialecki" <ab@getopt.org> wrote:
>>
>>> Chris Olston wrote:
>>>> Sounds interesting. Pig is geared toward large-scale aggregation
>>>> operations, in the style of OLAP.
>>>>
>>>> Regarding your 3rd paragraph question, do you mean:
>>>>
>>>> a) there are several interrelated aggregation expressions that  
>>>> you want
>>>> evaluated in just one pass over the data, or
>>>> b) you do some initial aggregation, display it to the user, who  
>>>> can do
>>>> "drill-down" operations in the GUI which require you to look up  
>>>> more
>>>> data in the backend
>>>>
>>>> ?
>>>>
>>>> For (a), yes Pig can do that, although currently you have to  
>>>> encode it
>>>> explicitly as a single Pig program (in future versions, we might  
>>>> be able
>>>> to take multiple related Pig programs and execute them in a joint
>>>> fashion). For (b), we don't currently have a mechanism to do that
>>>> without reloading the data, although perhaps the operating  
>>>> system's file
>>>> cache would help with that, under the covers, if the file  
>>>> partitions fit
>>>> in memory and don't get evicted.
>>>
>>> Would it be possible to modify Pig (and underlying local/ 
>>> mapreduce impl)
>>> so that if a specific syntax is used then an intermediate result  
>>> is also
>>> stored into a temporary file? This way, on the first dump/store Pig
>>> would produce all intermediate results, then keep some of them, and
>>> re-use them for subsequent operators?
>>>
>>> Example - let's say that ':=' means that the result should be kept
>>> around until exit (or until any of previous intermediate results  
>>> changes):
>>>
>>> -- A is not persisted
>>> A = load 'sample.txt' as (date, time, ip, query);
>>> -- B is to be persisted in a temp file
>>> B := group A by ip;
>>> -- compile & execute - creates B in a temp file
>>> dump B;
>>> C = foreach B generate group, query;
>>> -- this uses already existing B data from a temp file
>>> dump C;
>>>
>>
>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>


Mime
View raw message