hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pradeep Kamath (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data
Date Thu, 10 Sep 2009 21:10:57 GMT

    [ https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753816#action_12753816

Pradeep Kamath commented on PIG-953:

Response to previous comment:

Dmitriy:  seekNear seems ambiguous, as "near" is a generic concept that does not necessarily
imply "before or to, but not after" - which is what this method is required to do. How about
Pradeep: I had initially thought of naming this method seekBefore(). However for the case
where the key we are using to seek not being present in the right input, the loader can either
position at the biggest value before the key OR the smallest value after the key (zebra loader
for example can only do the latter). The name "seekBefore" suggests that implementations should
always seek before the key in question - hence I chose seekNear.

Dmitriy: it looks like the large diff blocks in MergeSort et al are mostly moves of code blocks,
not significant code changes, correct?
Pradeep: I am assuming you meant POMergeJoin - yes - its mostly code move to DefaultIndexableLoader

Dmitriy: Why does getAscColumns and getSortColumns make a copy of the list? Seems like we
can save some memory and cpu here.
Pradeep: Findbugs complains about passing internal members as is in getters since the caller
can then modifiy these internal members - hence the copy. This should not be a performance/memory
issue since these copies are 1) on small structures 2) in front end at compile time and not
at runtime.

Dmitriry: For that matter, why not use a map of (String)colName-> (Boolean)ascending instead
of 2 lists? One structure, plus O(1) lookup.
Pradeep: This suggestion is reasonable - I picked the current implementation since its inline
with how these things are represented today in LOSort.  Unless the user of SortInfo does lookup
using sort column names, we won't get O(1). I designed this keeping zebra's use case (which
is the only use case at this point) and the zebra store function would basically needs to
know the sort keys in order and which of them are asc/dsc. For this they would iterate over
our datastructure and require that the ordering of the keys match the primary/secondary order
of the sort keys - hence a list lends itself better for that. I debated using a List<Pair<String,Boolean>>
but thought I can avoid Pair since its a pig implementation class (if tomorrow zebra wants
to expose the same SortInfo to its users).

Dmitriy: Not sure about the use of super() in the constructor of a class that doesn't extend
anything but Object. Is there some magic that requires it?
Pradeep: This was unintended - all thanks to eclipse :) - will remove it a next iteration
of this patch

Dmitriy: In Log2PhysTranslator, why hardcode the Limit operator? There are other operators
that don't change sort order, such as filter. Perhaps add a method to Logical Operators that
indicates if they alter sort order of their inputs?
Pradeep: Pig only guarantees order with limit following order - for any other relational operator
following order there are no guarantees. Today it is true that filter or a column pruning
foreach would also preserve order but this can change if needed in the future. There explicit
code to ensure order-limit combination works by preserving order - there is no such explicit
check for other operators (keeping it open for change in the future)

Dmitriy: Even with this rewrite, this seems like an odd function. It being as odd as it is
leads to it not being used safely when you set checkEquality to false (just a few lines later)--
if obj1 is null and obj2 is not, the func returns true, you try to call a method on obj1,
and get an NPE.
Pradeep: The rewrite is more terse and is another option as against the explicit if-else I
have - more a coding style issue than correctness. The idea of having this function was to
serve as a helper for use in Classes which need to implement equals().  It is cumbersone that
every new class's equals has to do the equivalent of what you suggest:
Util.bothNull(obj1, obj2) || (Util.notNull(obj1, obj2) && obj1.equals(obj2));

This one method can be used either to only check that both objects are not null OR to do that
and additionally check equality. Not sure I understand the oddity - am I missing something?

Dmitriy: This comment has a typo (and instead of "an"): 
Pradeep: Will fix in next iteration

> Enable merge join in pig to work with loaders and store functions which can internally
index sorted data 
> ---------------------------------------------------------------------------------------------------------
>                 Key: PIG-953
>                 URL: https://issues.apache.org/jira/browse/PIG-953
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: PIG-953.patch
> Currently merge join implementation in pig includes construction of an index on sorted
data and use of that index to seek into the "right input" to efficiently perform the join
operation. Some loaders (notably the zebra loader) internally implement an index on sorted
data and can perform this seek efficiently using their index. So the use of the index needs
to be abstracted in such a way that when the loader supports indexing, pig uses it (indirectly
through the loader) and does not construct an index. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message