hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Utkarsh Srivastava <utka...@yahoo-inc.com>
Subject Re: [jira] Commented: (PIG-30) Get rid of DataBag and always use BigDataBag
Date Mon, 07 Jan 2008 17:47:37 GMT
Ok, all sounds good.

On Jan 7, 2008, at 8:30 AM, Alan Gates (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/PIG-30? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel&focusedCommentId=12556616#action_12556616 ]
>
> Alan Gates commented on PIG-30:
> -------------------------------
>
> Responses to Utkarsh's comments:
>
> 0.  TreeSet.add() only adds an element if it is not already present  
> (see http://java.sun.com/j2se/1.5.0/docs/api/java/util/ 
> TreeSet.html#add(E)).  This guarantees that the element already in  
> the tree will not be obliterated.  That's why if that call returns  
> false, the code goes back and rereads from the file it read the  
> last element from.  This guarantees that we read from that file  
> until either the file is empty or we find a new unique element to  
> put in the TreeSet.
>
> 1.  Good catch, I'll add a hashcode() implementation for DataBag.
>
> 2.  They aren't quite as combinable as they first appear.  The code  
> in next() is identical, and could be combined.   
> DistinctDataBag.readFromTree() and SortedDataBag.readFromPriorityQ 
> () create different containers and access them differently.  I  
> could put just the create and access methods in each and combine  
> the rest of the logic.  The addToQueue() functions in each are  
> different and have different logic about how to add an element to  
> the queue.   I can work on this, but it may be a bit before I get  
> to it.
>
>> Get rid of DataBag and always use BigDataBag
>> --------------------------------------------
>>
>>                 Key: PIG-30
>>                 URL: https://issues.apache.org/jira/browse/PIG-30
>>             Project: Pig
>>          Issue Type: Bug
>>          Components: data
>>            Reporter: Benjamin Reed
>>            Assignee: Alan Gates
>>         Attachments: bagrewrite.patch
>>
>>
>> We should never use DataBag directly; instead, we should always  
>> use BigDataBag. I think we already do this. The problem is that  
>> the logic in BigDataBag is hard to follow and it is made more  
>> complicated because it subclasses DataBag. We should merge these  
>> two classes together.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>


Mime
View raw message