hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Utkarsh Srivastava (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag
Date Sat, 05 Jan 2008 04:44:34 GMT

    [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556172#action_12556172
] 

Utkarsh Srivastava commented on PIG-30:
---------------------------------------

Great job! This was a fairly large chunk of work.

It will be nice to have a few more comments. Specifically, one part that is implicit is that
bag behavior is undefined if you add() to Databag after opening an iterator(). Alan and I
talked about this.

Other issues:

0. TreeSet used in DistinctBag while merging files. But TContainer compares only based on
tuple equality. Once you add a tuple equal to the one already in the treeset but from another
input, one of the inputs will get eliminated from the treeset and never be read again. Am
I missing something?

1. HashSet<> in DistinctBag. For hash set to work properly we need hashcode() methods
to work properly. Since Tuple.hashcode() calls hashcode() on all its fields, all Datums should
have a hash code. Databag doesn't have one which implies that DistinctBag wont work with nested
data.

2. Spill() code in DistinctBag and sortedbag() is the same except that the former always uses
the default comparator whereas sortedBag might use a specified comparator. Can we reuse code
instead of duplicating?



> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think
we already do this. The problem is that the logic in BigDataBag is hard to follow and it is
made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message