hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution
Date Mon, 09 Feb 2009 12:35:06 GMT

     [ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shravan Matthur Narayanamurthy updated PIG-545:
-----------------------------------------------

    Attachment: WRP1.patch

Ran some tests and this quantiles scheme seems to have the least deviation from perfect distribution.
Also, the time took for L10 has reduced. It took 8 mins vs 7 mins for the old code. But it
produces a good distribution as shown below: The patch also modifies MRCompiler to fix sort
on multiple fields with different order for each column.
New algorithm:
{noformat}
/part-00000<r 3>	396866140
/part-00001<r 3>	388565356
/part-00002<r 3>	412419093
/part-00003<r 3>	404673062
/part-00004<r 3>	407805613
/part-00005<r 3>	399685590
/part-00006<r 3>	374470156
/part-00007<r 3>	407210410
/part-00008<r 3>	392022575
/part-00009<r 3>	403592598
/part-00010<r 3>	407005509
/part-00011<r 3>	392739807
/part-00012<r 3>	407132246
/part-00013<r 3>	393974442
/part-00014<r 3>	394310422
/part-00015<r 3>	397676923
/part-00016<r 3>	408960794
/part-00017<r 3>	407120924
/part-00018<r 3>	398555578
/part-00019<r 3>	398831802
/part-00020<r 3>	381319493
/part-00021<r 3>	397961816
/part-00022<r 3>	408716378
/part-00023<r 3>	401850651
/part-00024<r 3>	394624621
/part-00025<r 3>	411533286
/part-00026<r 3>	397598333
/part-00027<r 3>	402013011
/part-00028<r 3>	412664722
/part-00029<r 3>	390615865
/part-00030<r 3>	402257701
/part-00031<r 3>	404278892
/part-00032<r 3>	408376085
/part-00033<r 3>	403230193
/part-00034<r 3>	396062725
/part-00035<r 3>	403166437
/part-00036<r 3>	396123295
/part-00037<r 3>	400208557
/part-00038<r 3>	396028297
/part-00039<r 3>	428541846
{noformat}
Old Algorithm:
{noformat}
/part-00000<r 3>	39703
/part-00001<r 3>	396917259
/part-00002<r 3>	388958263
/part-00003<r 3>	412109839
/part-00004<r 3>	405626251
/part-00005<r 3>	411808194
/part-00006<r 3>	385084639
/part-00007<r 3>	618796205
/part-00008<r 3>	59754649
/part-00009<r 3>	506719655
/part-00010<r 3>	403039137
/part-00011<r 3>	406540458
/part-00012<r 3>	395629722
/part-00013<r 3>	404795418
/part-00014<r 3>	394881722
/part-00015<r 3>	393959841
/part-00016<r 3>	398194260
/part-00017<r 3>	408370148
/part-00018<r 3>	334248039
/part-00019<r 3>	260118680
/part-00020<r 3>	642453106
/part-00021<r 3>	383168594
/part-00022<r 3>	364791108
/part-00023<r 3>	408601454
/part-00024<r 3>	404588449
/part-00025<r 3>	392940424
/part-00026<r 3>	413354408
/part-00027<r 3>	412538285
/part-00028<r 3>	385894942
/part-00029<r 3>	412674723
/part-00030<r 3>	392572446
/part-00031<r 3>	403012671
/part-00032<r 3>	398679596
/part-00033<r 3>	410864380
/part-00034<r 3>	405389743
/part-00035<r 3>	397248129
/part-00036<r 3>	401438264
/part-00037<r 3>	396456821
/part-00038<r 3>	402122621
/part-00039<r 3>	816408998
{noformat}

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: WRP.patch, WRP1.patch
>
>
> In running tests on actual data, I've noticed that the final reduce of an order by has
skewed partitions.  Some reduces finish in a few seconds while some run for 20 minutes.  Getting
a better distribution should lead to much better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message