datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jan Willem (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DATAFU-91) pig version of HyperLogLog estimator should be Algebraic and use combiners
Date Thu, 11 Jun 2015 09:00:15 GMT

    [ https://issues.apache.org/jira/browse/DATAFU-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581699#comment-14581699
] 

Jan Willem commented on DATAFU-91:
----------------------------------

I am under the impression that the type output by both initial and intermediate should be
the same. Looking at how AVG is implemented (http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pig/pig/0.8.0/org/apache/pig/builtin/AVG.java),
or the udf manual on the wiki (https://wiki.apache.org/pig/UDFManual search for algebraic),
they return a tuple containing the combined information.

I'm probably just obsessing about the type cast, but you could get around it by using a tagged
union: a tuple of two, with the first element indicating whether it's a Long in the second
field, or rather a serialized HyperLogLogPlus. So: (1, <longvalue>) or (2, <HyperLogLogPlus>).
You could also go for the first value containing the number of items, and as a special case
have the 1 case contain just a long value: (1, <longvalue>) or (<value larger than
1>, <HyperLogLogValue>).

It would only add a little to the size, and get rid of the instanceof.

It's just a matter of opinion, I guess.

> pig version of HyperLogLog estimator should be Algebraic and use combiners
> --------------------------------------------------------------------------
>
>                 Key: DATAFU-91
>                 URL: https://issues.apache.org/jira/browse/DATAFU-91
>             Project: DataFu
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: Ido Hadanny
>            Assignee: Ido Hadanny
>            Priority: Minor
>             Fix For: 1.3.0
>
>         Attachments: hyper-log-log-algebraic-3.diff, hyper-log-log-algebraic.diff, hyper-log-log-algebraic.diff
>
>
> Matt: I don't remember if there was a particular reason I didn't implement this as AlgebraicEvalFunc.
It seems like it could be. I believe the Java MapReduce version leverages the combiner. If
you want to try making this Algebraic we would be happy to accept a patch :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message