hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag
Date Wed, 09 Jan 2008 22:27:33 GMT

    [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557457#action_12557457
] 

Alan Gates commented on PIG-30:
-------------------------------

Some performance numbers based on the code before and after these changes.  I tested default
bags (that is, no sorting, no distinct), distinct bags, and sorted bags.  Each test was run
on the code pre- and post-patch.  Each test was run on data with 100k rows, 1m rows, and and
5m rows.

Default:

pig script:

a = load './studenttab5m';

b = group a all;

c = foreach b generate group, COUNT(a.$0);

dump c;

Results:

pre patch, 100k rows:  13.539

post 100k:  15.489

pre 1m:  43.002

post 1m:  48.191

pre 5m: 111.158

post 5m:  117.112

Notes:  I'm assuming the slight slowdown here is do to the introduction of locking into add()
and next() in the data bags.

Distinct

pig script:

a = load './studenttab10m';

b = group a all;

c = foreach b { c1 = distinct $1; generate group, COUNT(c1); }

dump c;

pre-patch 100k rows:  14.927

post 100k:  14.134

pre 1m:  83.190

post 1m: 52.320

pre 5m:  744.834

post 5m:  216.043

Notes:  Data had about 90% distinct values, so 100k had about 90k distinct rows, etc.

Sorted

pig script:

a = load './studenttab5m';

b = group a all;

c = foreach b { c1 = order $1 by $0; generate group, COUNT(c1); }

dump c;

pre-patch 100k rows:  16.964

post 100k: 12.895

pre 1m:  51.351

post 1m:  51.598

pre 5m:  236.669

post 5m:  225.688

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: addhashcode.patch, bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think
we already do this. The problem is that the logic in BigDataBag is hard to follow and it is
made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message