hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-845) PERFORMANCE: Merge Join
Date Mon, 10 Aug 2009 22:11:14 GMT

    [ https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741589#action_12741589
] 

Dmitriy V. Ryaboy commented on PIG-845:
---------------------------------------

Some Comments below.
It's a big patch, so a lot of comments...

1. 
EndOfAllInput flags -- could you add comments here about what the point of this flag is? You
explain what EndOfAllInputSetter does (which is actually rather self-explanatory) but not
what the meaning of the flag is and how it's used. There is a bit of an explanation in PigMapBase,
but it really belongs here.

2.
Could you explain the relationship between EndOfAllInput and (deleted) POStream?

3.
Comments in MRCompiler alternate between referring to the left MROp as LeftMROper and curMROper.
Choose one.

4.
I am curious about the decision to throw compiler exceptions if MergeJoin requirements re
number of inputs, etc, aren't satisfied. It seems like a better user experience would be to
log a warning and fall back to a regular join.

5.
Style notes for visitMergeJoin: 

It's a 200-line method. Any way you can break it up into smaller components? As is, it's hard
to follow.

The if statements should be broken up into multiple lines to agree with the style guides.

Variable naming: you've got topPrj, prj, pkg, lr, ce, nig.. one at a time they are fine, but
together in a 200-line method they are undreadable. Please consider more descriptive names.

6.
Kind of a global comment, since it applies to more than just MergeJoin:

It seems to me like we need a Builder for operators to clean up some of the new, set, set,
set stuff.

Having the setters return this and a Plan's add() method return the plan, would let us replace
this:

POProject topPrj = new POProject(new OperatorKey(scope,nig.getNextNodeId(scope)));
topPrj.setColumn(1);
topPrj.setResultType(DataType.TUPLE);
topPrj.setOverloaded(true);
rightMROpr.reducePlan.add(topPrj);
rightMROpr.reducePlan.connect(pkg, topPrj);

with this:

POProject topPrj = new POProject(new OperatorKey(scope,nig.getNextNodeId(scope)))
	.setColumn(1).setResultType(DataType.TUPLE)
	.setOverloaded(true);

rightMROpr.reducePlan.add(topPrj).connect(pkg, topPrj)


7.
Is the change to List<List<Byte>> keyTypes in POFRJoin related to MergeJoin or
just rolled in?

8. MergeJoin

break getNext() into components.

I don't see you supporting Left outer joins. Plans for that? At least document the planned
approach.

Error codes being declared deep inside classes, and documented on the wiki, is a poor practice,
imo. They should be pulled out into PigErrors (as lightweight final objects that have an error
code, a name, and a description..) I thought Santhosh made progress on this already, no?

Could you explain the problem with splits and streams? Why can't this work for them?


9. Sampler/Indexer:
9a
Looks like you create the same number of map tasks for this as you do for a join; all a sampling
map task does is read one record and emit a single tuple.  That seems wasteful; there is a
lot of overhead in setting up these tiny jobs which might get stuck behind other jobs running
on the cluster, etc. If the underlying file has syncpoints, a smaller number of MR tasks can
be created. If we know the ratio of sample tasks to "full" tasks, we can figure out how many
records we should emit per job ( ceil(full_tasks/sample_tasks) ).  We can approximately achieve
this through seeking trough (end-offset)/num_to_emit and doing a sync() after that seek. It's
approximate, but close enough for an index.

9b
Consider renaming to something like SortedFileIndexer, since it's coneivable that this component
can be reused in a context other than a Merge Join.

10.
Would it make sense to expose this to the users via a 'CREATE INDEX' (or similar) command?
That way the index could be persisted, and the user could tell you to use an existing index
instead of rescanning the data.

11.
I am not sure about the approach of pushing sampling above filters. Have you guys benchmarked
this? Seems like you'd wind up reading the whole file in the sample job if the filter is selective
enough (and high filter selectivity would also make materialize->sample go much faster).

Testing: 
12a
You should test for refusal to do 3-way join and other error condition (or a warning and successful
failover to regular join -- my preference)

12b
You should do a proper unit test for the MergeJoinIndexer (or whatever we are calling it).



> PERFORMANCE: Merge Join
> -----------------------
>
>                 Key: PIG-845
>                 URL: https://issues.apache.org/jira/browse/PIG-845
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Ashutosh Chauhan
>         Attachments: merge-join-1.patch, merge-join-for-review.patch
>
>
> Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message