pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Klish (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-2293) Pig should support a more efficient merge join against data sources that natively support point lookups or where the join is against large, sparse tables.
Date Tue, 27 Sep 2011 02:08:13 GMT

     [ https://issues.apache.org/jira/browse/PIG-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aaron Klish updated PIG-2293:
-----------------------------

    Fix Version/s: 0.10
     Release Note: 
Overview
--------

 * This patch adds a new join type, 'merge-sparse' to PIG.
 * This patch adds a new 'IndexedStorage' UDF to piggybank. 

'merge-sparse' join operator
-----------------------------

The new functionality is similar to the existing 'merge' join in the following ways:
 * It expects the input to be sorted in both the left and right tables.
 * The join can be accomplished as a map-only merge of the left and right tables.
 * The right table loader must implement IndexableLoadFunc.  If it does not, the join 
will behave like a standard 'merge' join.

The new functionality is different from the existing 'merge' join in the following ways:
 * The performance characteristics of the join operations are different.
 * For every key in the left table, the join operator will seek to the corresponding key in
the right table. 
  
The existing 'merge' join implementation only seeks to the first key from each input split
of the left table.
It performs similar to a table scan with the optimization to skip records that belong to other
input splits.
This implementation performs best when the join is dense.  In this case, scanning block by
block will be 
efficient because of the likelihood of finding matches in many of the scanned blocks.

The new 'merge-sparse' join implementation performs best when the join is sparse and the right
table
implements an efficient seek operation (typically, the right table should be indexed by the
join keys).
In this case, the additional overhead of per record seeks is less than the IO & processing

overhead of a partial table scan.

The performance tradeoff between the two methods depends on a number of factors:
 * The density of matching join keys in the right table.
 * The overhead/efficiency of the seek operator implementation of the right loader.  
 * The size and number of keys in the index.

Finding the exact performance inflection point will require trial and error for each data
set
and implementation of the right loader.

IndexedStorage UDF
------------------

IndexedStorage is a form of PigStorage that supports a per record seek.  
IndexedStorage creates a separate (hidden) index file for
every data file that is written.  The format of the index file is:
| Header     |
| Index Body |
| Footer     |
The Header contains the list of record indices (field numbers) that represent index keys.
The Index Body contains a PIG Tuple for each record in the data.  
The fields of the Tuple are:
 * The index key(s) as a PIG Tuple 
 * The number of records that share this index key. 
 * Offset into the data file to read the first matching record. 
The Footer contains sequentially:
 * The smallest key(s) Tuple in the index. 
 * The largest key(s) Tuple in the index. 
 * The offset in bytes to the start of the footer. 

IndexedStorage implements IndexableLoadFunc and
can be used as the 'right table' in a PIG 'merge' or 'merge-sparse' join.

IndexedStorage does not require the data to be partitioned by index key(s).  
However, each partition (separate index) must be >locally< sorted by the index key(s)

(partitions can contain overlapping ranges of keys).

           Status: Patch Available  (was: In Progress)

E2E framework test cases to be submitted by Daniel.
                
> Pig should support a more efficient merge join against data sources that natively support
point lookups or where the join is against large, sparse tables.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2293
>                 URL: https://issues.apache.org/jira/browse/PIG-2293
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.9.0
>            Reporter: Aaron Klish
>            Assignee: Aaron Klish
>             Fix For: 0.10
>
>         Attachments: patch.txt, patch.txt
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The existing PIG merge join has the following limitations:
>    1. It assumes the right side of the table must be accessed sequentially - record by
record.
>    2. It does not perform well against large, sparse tables.
> The current implementation of the merge join introduced the interface IndexableLoadFunc.
 This 'LoadFunc'
> supports the ability to 'seekNear' a given key (before reading the next record).  
> The merge join physical operator only calls 'seekNear' for the first key in each split
(effectively eliminating splits
> where the first and subsequent keys will not be found).  Subsequent joins are found by
reading sequentially through
> the records on the right table looking for matches from the left table.
> While this method works well for dense join tables - it performs poorly against large
sparse tables or data sources that support 
> point lookups natively (HBase for example).
> The proposed enhancement is to add a new join type - 'merge-sparse' to PIG latin.  When
specified in the PIG script, this join type
> will cause the merge join operator to call seekNear on each and every key (rather than
just the first in each split).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message