hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag
Date Mon, 07 Jan 2008 16:30:34 GMT

    [ https://issues.apache.org/jira/browse/PIG-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556616#action_12556616
] 

Alan Gates commented on PIG-30:
-------------------------------

Responses to Utkarsh's comments:

0.  TreeSet.add() only adds an element if it is not already present (see http://java.sun.com/j2se/1.5.0/docs/api/java/util/TreeSet.html#add(E)).
 This guarantees that the element already in the tree will not be obliterated.  That's why
if that call returns false, the code goes back and rereads from the file it read the last
element from.  This guarantees that we read from that file until either the file is empty
or we find a new unique element to put in the TreeSet.

1.  Good catch, I'll add a hashcode() implementation for DataBag.

2.  They aren't quite as combinable as they first appear.  The code in next() is identical,
and could be combined.  DistinctDataBag.readFromTree() and SortedDataBag.readFromPriorityQ()
create different containers and access them differently.  I could put just the create and
access methods in each and combine the rest of the logic.  The addToQueue() functions in each
are different and have different logic about how to add an element to the queue.   I can work
on this, but it may be a bit before I get to it.

> Get rid of DataBag and always use BigDataBag
> --------------------------------------------
>
>                 Key: PIG-30
>                 URL: https://issues.apache.org/jira/browse/PIG-30
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>            Reporter: Benjamin Reed
>            Assignee: Alan Gates
>         Attachments: bagrewrite.patch
>
>
> We should never use DataBag directly; instead, we should always use BigDataBag. I think
we already do this. The problem is that the logic in BigDataBag is hard to follow and it is
made more complicated because it subclasses DataBag. We should merge these two classes together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message