hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
Date Wed, 14 Oct 2009 17:46:31 GMT

    [ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765663#action_12765663

Dmitriy V. Ryaboy commented on PIG-966:

Regarding historgram representation:

I took a look at how Postgres does it, and they simply store 3 arrays:

* An array of "Most Common Values", which contains exactly what it sounds like, ordered in
decreasing frequency
* A matching array of frequencies, expressed as a fraction of the total row count in the relation.
* an array of sorted values chosen in such a way that the number of rows with values between
A[i] and A[i+1] is roughly the same for all i.  An interesting optimization they perform is
that if the most common values array described above is defined for this field, then the values
in that array are not included when calculating the boundaries for the histogram. They say
that's called a "compressed histogram", if someone wants to dig up some papers on this.

Any objections to this design?

> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> ---------------------------------------------------------------
>                 Key: PIG-966
>                 URL: https://issues.apache.org/jira/browse/PIG-966
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces significantly.
 See http://wiki.apache.org/pig/LoadStoreRedesignProposal for full details

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message