hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <tdunn...@veoh.com>
Subject RE: possible use of Pig for OLAP
Date Tue, 20 Nov 2007 17:32:28 GMT

I would see PIG for large scale analytics filling up Hbase for fast query and reporting of
the aggregates as an interesting option.

In the future.

The pieces of this vision are definitely not there yet.

-----Original Message-----
From: Chris Olston [mailto:olston@yahoo-inc.com]
Sent: Tue 11/20/2007 9:29 AM
To: pig-dev@incubator.apache.org
Subject: Re: possible use of Pig for OLAP
Sounds interesting. Pig is geared toward large-scale aggregation  
operations, in the style of OLAP.

Regarding your 3rd paragraph question, do you mean:

a) there are several interrelated aggregation expressions that you  
want evaluated in just one pass over the data, or
b) you do some initial aggregation, display it to the user, who can  
do "drill-down" operations in the GUI which require you to look up  
more data in the backend


For (a), yes Pig can do that, although currently you have to encode  
it explicitly as a single Pig program (in future versions, we might  
be able to take multiple related Pig programs and execute them in a  
joint fashion). For (b), we don't currently have a mechanism to do  
that without reloading the data, although perhaps the operating  
system's file cache would help with that, under the covers, if the  
file partitions fit in memory and don't get evicted.


On Nov 20, 2007, at 1:47 AM, Alexandru Toth wrote:

> Hi,
> I am developing an Open Source OLAP application called "Cubulus". The
> code is at http://sourceforge.net/projects/cubulus/ , a brief
> presentation material at http://cubulus.sourceforge.net/ , and an
> online demo at: http://alxtoth.webfactional.com
> It would be interresting to use Pig instead of relational databases  
> as backend.
> The question is: can Pig scripts work is such manner that the file is
> loaded only once, and then subsequent web requests process over and
> over the same file? This becomes relevant if the data file is large,
> and there is one datafile to process (or few datafiles). In fact, is
> repated loading a problem at all :-) ?
> -Alex

Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message