hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Assigned: (PIG-342) Size of DistinctDataBag is calculated incorrectly if spill occurs and non-distinct elements are inserted
Date Thu, 12 Nov 2009 04:19:39 GMT

     [ https://issues.apache.org/jira/browse/PIG-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Alan Gates reassigned PIG-342:

    Assignee: Brandon Dimcheff

> Size of DistinctDataBag is calculated incorrectly if spill occurs and non-distinct elements
are inserted
> --------------------------------------------------------------------------------------------------------
>                 Key: PIG-342
>                 URL: https://issues.apache.org/jira/browse/PIG-342
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.1.0
>            Reporter: Brandon Dimcheff
>            Assignee: Brandon Dimcheff
>             Fix For: 0.1.0
>         Attachments: size.patch
> If a spill occurs while elements are being inserted into a DistinctDataBag, it's possible
that non-unique items will be added to the in-memory data structure, and the mSize counter
will be incremented.  If the same elements also exist on disk, the count will be higher than
it should be.
> The following is copied from an email exchange I had with Alan Gates:
> Alan,
> Thanks for your help.  I've done a bit more experimentation and have discovered a couple
more things.  I first looked at how COUNT was implemented.  It looks like COUNT calls size()
on the bag, which will return mSize.  I thought that mSize might be calculated improperly
so I added "SUM(unique_ids) AS crazy_userid_sum" to my GENERATE line and re-ran the pigfile:
> GENERATE FLATTEN(group), SUM(nice_data.duration) AS total_duration, COUNT(nice_data)
AS channel_switches, COUNT(unique_ids) AS unique_users, SUM(unique_ids) AS crazy_userid_sum;
> It turns out that the SUM generates the correct result in all cases, while there are
still occasional errors in the COUNT.  Since SUM requires an iteration over all the elements
in the DistinctDataBag, this led me to believe that the uniqueness constraint is indeed operating
correctly, but there is some error in the logic that calculates mSize.
> Then I started poking around in DistinctDataBag looking for anything that changes mSize
that might be incorrect.  I noticed that on line 87 in addAll(), the size of the DataBag that
is passed into the method is added to the mSize instance variable, and then during the iteration
a few lines later mSize is being incremented when an element is successfully added to mContents.
 I thought this might be the problem, since it seems like elements would be double counted
if addAll() was called.  I commented out line 87, recompiled Pig, and ran it again, but there
are still errors (though I do think line 87 might be incorrect anyways).
> Thanks to my coworker Marshall, I think we may have discovered what the actual problem
is.  The scenario is as follows:  We're adding a bunch of stuff to the bag, and before we're
finished a spill occurs.  mContents is cleared during the spill (line 157).  All add() does
is check uniqueness against mContents.  So now we will get duplicates in mContents that are
already on disk and an inflated mSize.  Now, the reason why SUM works is because the iterator
is smart and enforces uniqueness as it reads the records back in. We think this occurs at
the beginning of addToQueue, around line 363 - 369.  mMergeTree is a TreeSet, so it'll enforce
uniqueness and the call to addToQueue is aborted if there's already a matching record in mMergeTree.
> Do you think our assessment is correct?  If so, it seems that the calculation of mSize
needs to be significantly more complex than it is now.  It looks to me like the entire bag
will need to be iterated in order to reliably calculate the size.  Do you have any ideas about
how to implement this in a less expensive way?  I'd be happy to take a stab at it, but I don't
want to do anything particularly silly if you have a better idea.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message