hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: SQL to PigLatin compiler
Date Wed, 14 May 2008 19:46:18 GMT
This is very exciting.  Having a SQL->PigLatin translator will bring a 
lot of value to users who want to use pig and already know SQL.

Answers inline:

Firat Kiyak wrote:
> Hi all,
>
> I have been working on the SQL to PigLatin compiler and I have a couple
> questions that you might know the answers.
>
> 1. Is there a way to obtain the list of existing relations in the Pig Server
> and their schemas? I need it for simple cases where users apply SQL join to
> multiple relations. I need to resolve which attribute goes to which
> relation.
>   
Take a look at PigServer.explain().  Even if it doesn't work as is, the 
visitor infrastructure it uses is what you'll want to use to build 
something to do what you need.
> 2. Subqueries need similar keywords to IN, EXIST, ANY, ALL... Is there any
> plan to add similar keywords to PigLatin in the future? I could convert most
> of the subqueries to joins (even if not all of them), but it would be much
> easier to add similar keywords and use those constructs to process
> subqueries. Do you have any recommendations on this? Actually, there is
> IsEmpty function in PigLatin which could be used for EXIST; however, it is
> not implemented as a conditional that I can use in foreach statement with
> ands and ors.
>   
We certainly aren't opposed to adding the functionality.   Handling 
subqueries well takes a good optimizer.  As yet, pig has no optimizer, 
though there is work going on to add one.  So I think these features 
will be a ways off.  So far we haven't had anyone asking for them.
      
> 3. Outer Joins need NULL values in the relations. PigLatin doesn't have a
> NULL type, so how should I proceed for this task? Are you planning to add
> NULL type to PigLatin in the future releases?
>   
There is work going on now in the types branch that includes, among many 
other things, adding nulls.   But doing outer joins is harder than just 
having null.  Consider the following script:

A = load 'myfile';
B = load 'myotherfile';
C = cogroup A by $0, B by $0 outer

assuming this new outer syntax is the equivalent of a right outer join 
in SQL.  When you find a record in B with no corresponding record in A, 
what fields do you append to that record and then assign null values?  
In SQL you can answer this question, because you know the schema of A.  
But in pig we may not.  In the script given above the user hasn't 
declared the schema of A.  And you can't just look at other records of A 
and guess, because pig does not require users to provide a uniform 
schema (that is, some records in A might have 2 fields, and some 3, and 
that's ok).  You could restrict it and say that outer cogroups are only 
allowed when users have declared their schema, so:

A = load 'myfile' as (a, b, c);
B = load 'myotherfile' as (d, e, f);
C = cogroup A by $0, B by $0 outer

would them be legal, and the executor would know what to do.

> 4. Relation alterations cannot be implemented since the only way to create a
> relation in PigLatin is to load a file. I need supporting constructs for
> CREATE TABLE, INSERT, DROP TABLE, ALTER TABLE statements. For example, how
> can you create an empty relation and insert rows one by one. How can you get
> rid of the table? How can you add/remove a column in a table?
>   
This is really a much larger problem.  An RDBMS has a whole set of 
structures (tables, views, indices, etc.) that pig has no concept of.  
An RDBMS also has a data catalog, which pig does not.  While hadoop 
users may not need the full blown data catalog that and RDBMS has, they 
will need some central data.  The pig team's thoughts to date have been 
to (at some future time) provide an interface that allows users to 
create connections to external data catalogs.  How that interface will 
look or deal with the differing levels of data abstraction (files vs 
tables) has not been thought through.  If I were trying to implement SQL 
now (as you are), I would forget all the DDL parts of the language 
(create, alter, drop) and say that a file = a table, so that you could 
do things like:

select a, b, c
from myfile;
> I would really appreciate if anybody could address these questions. I am
> done with most of the parts of the compiler, and I am excited to deliver it
> shortly.
>
> ---
> Firat Kiyak
> Computer Science PhD. candidate
> University of Illinois at Urbana-Champaign
>
>   

Mime
View raw message