pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Klish (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2293) Pig should support a more efficient merge join against data sources that natively support point lookups or where the join is against large, sparse tables.
Date Mon, 03 Oct 2011 18:54:34 GMT

    [ https://issues.apache.org/jira/browse/PIG-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119496#comment-13119496
] 

Aaron Klish commented on PIG-2293:
----------------------------------

Hi Thejas,

If you look at the release notes, I commented a bit on the guidance.  Basically though, I
don't think any meaningful % can be given - because the performance depends on too many other
factors.

Users will have to try both methods.

Aaron
                
> Pig should support a more efficient merge join against data sources that natively support
point lookups or where the join is against large, sparse tables.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2293
>                 URL: https://issues.apache.org/jira/browse/PIG-2293
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.9.0
>            Reporter: Aaron Klish
>            Assignee: Aaron Klish
>             Fix For: 0.10
>
>         Attachments: PIG-2293-1.patch, PIG-2293-2.patch, PIG-2293-3.patch, PIG-2293-4.patch,
e2e_test.txt, patch.txt, patch.txt
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The existing PIG merge join has the following limitations:
>    1. It assumes the right side of the table must be accessed sequentially - record by
record.
>    2. It does not perform well against large, sparse tables.
> The current implementation of the merge join introduced the interface IndexableLoadFunc.
 This 'LoadFunc'
> supports the ability to 'seekNear' a given key (before reading the next record).  
> The merge join physical operator only calls 'seekNear' for the first key in each split
(effectively eliminating splits
> where the first and subsequent keys will not be found).  Subsequent joins are found by
reading sequentially through
> the records on the right table looking for matches from the left table.
> While this method works well for dense join tables - it performs poorly against large
sparse tables or data sources that support 
> point lookups natively (HBase for example).
> The proposed enhancement is to add a new join type - 'merge-sparse' to PIG latin.  When
specified in the PIG script, this join type
> will cause the merge join operator to call seekNear on each and every key (rather than
just the first in each split).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message