hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Andres.Qui...@parc.com>
Subject RE: Request for feedback on work intent for non-equijoin support
Date Wed, 01 Apr 2015 23:23:01 GMT
Dear Lefty,

Thank you very much for pointing that out and for your initial pointers. Here is the missing
link:

ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf

Regards,

Andrés

-----Original Message-----
From: Lefty Leverenz [mailto:leftyleverenz@gmail.com] 
Sent: Wednesday, April 01, 2015 12:48 AM
To: dev@hive.apache.org
Subject: Re: Request for feedback on work intent for non-equijoin support

Hello Andres, the link to your paper is missing:

In our preliminary work, which you can find here (pointer to the paper) ...


You can find general information about contributing to Hive in the
wiki:  Resources
for Contributors
<https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors>
, How to Contribute
<https://cwiki.apache.org/confluence/display/Hive/HowToContribute>.

-- Lefty

On Tue, Mar 31, 2015 at 10:42 PM, <Andres.Quiroz@parc.com> wrote:

>  Dear Hive development community members,
>
>
>
> I am interested in learning more about the current support for 
> non-equijoins in Hive and/or other Hadoop SQL engines, and in getting 
> feedback about community interest in more extensive support for such a 
> feature. I intend to work on this challenge, assuming people find it 
> compelling, and I intend to contribute results to the community. Where 
> possible, it would be great to receive feedback and engage in 
> collaborations along the way (for a bit more context, see the 
> postscript of this message).
>
>
>
> My initial goal is to support query conditions such as the following:
>
>
>
> A.x < B.y
>
> A.x in_range [B.y, B.z]
>
> distance(A.x, B.y) < D
>
>
>
> where A and B are distinct tables/files. It is my understanding that 
> current support for performing non-equijoins like those above is quite 
> limited, and where some forms are supported (like in Cloudera's 
> Impala), this support is based on doing a potentially expensive cross product join.
> Depending on the data types involved, I believe that joins with these 
> conditions can be made to be tractable (at least on the average) with 
> join algorithms that exploit properties of the data types, possibly 
> with some pre-scanning of the data.
>
>
>
> I am asking for feedback on the interest & need in the community for 
> this work, as well as any pointers to similar work. In particular, I 
> would appreciate any answers people could give on the following questions:
>
>
>
> - Is my understanding of the state of the art in Hive and similar 
> tools accurate? Are there groups currently working on similar or 
> related issues, or tools that already accomplish some or all of what I have proposed?
>
> - Is there significant value to the community in the support of such a 
> feature? In other words, are the manual workarounds necessary because 
> of the absence of non-equijoins such as these enough of a pain to 
> justify the work I propose?
>
> - Being aware that the potential pre-scanning adds to the cost of the 
> join, and that data could still blow-up in the worst case, am I 
> missing any other important considerations and tradeoffs for this problem?
>
> - What would be a good avenue to contribute this feature to the 
> community (e.g. as a standalone tool on top of Hadoop, or as a Hive 
> extension or plugin)?
>
> - What is the best way to get started in working with the community?
>
>
>
> Thanks for your attention and any info you can provide!
>
>
>
> Andres Quiroz
>
>
>
> P.S. If you are interested in some context, and why/how I am proposing 
> to do this work, please read on.
>
>
>
> I am part of a small project team at PARC working on the general 
> problems of data integration and automated ETL. We have proposed a 
> tool called HiperFuse that is designed to accept declarative, 
> high-level queries in order to produce joined (fused) data sets from 
> multiple heterogeneous raw data sources. In our preliminary work, 
> which you can find here (pointer to the paper), we designed the 
> architecture of the tool and obtained some results separately on the 
> problems of automated data cleansing, data type inference, and query 
> planning. One of the planned prototype implementations of HiperFuse 
> relies on Hadoop MR, and because the declarative language we proposed 
> was closely related to SQL, we thought that we could exploit the 
> existing work in Hive and/or other open-source tools for handling the 
> SQL part and layer our work on top of that. For example, the query 
> given in the paper could easily be expressed in SQL-like form with a 
> non-equijoin
> condition:
>
>
>
> SELECT web_access_log.ip, census.income
>
> FROM web_access_log, ip2zip, census
>
> WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
>
> AND ip2zip.zip = census.zip
>
>
>
> As you can see, the first impasse that we hit in order to bring the 
> elements together to solve this query end-to-end was the realization 
> and performance of the non-equality join in the query. The intent now 
> is to tackle this problem in a general sense and provide a solution 
> for a wide range of queries.
>
>
>
> The work I propose to do would be based on three main components 
> within
> HiperFuse:
>
>
>
> - Enhancements to the extensible data type framework in HiperFuse that 
> would categorize data types based on the properties needed to support 
> the join algorithms, in order to write join-ready domain-specific data 
> type libraries.
>
> - The join algorithms themselves, based on Hive or directly on Hadoop MR.
>
> - A query planner, which would determine the right algorithm to apply 
> and automatically schedule any necessary pre-scanning of the data.
>
>
>
Mime
View raw message