Return-Path: X-Original-To: apmail-datafu-dev-archive@minotaur.apache.org Delivered-To: apmail-datafu-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BBA6217434 for ; Mon, 27 Apr 2015 14:58:06 +0000 (UTC) Received: (qmail 48657 invoked by uid 500); 27 Apr 2015 14:58:06 -0000 Delivered-To: apmail-datafu-dev-archive@datafu.apache.org Received: (qmail 48612 invoked by uid 500); 27 Apr 2015 14:58:06 -0000 Mailing-List: contact dev-help@datafu.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@datafu.incubator.apache.org Delivered-To: mailing list dev@datafu.incubator.apache.org Delivered-To: moderator for dev@datafu.incubator.apache.org Received: (qmail 12816 invoked by uid 99); 27 Apr 2015 13:27:06 -0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=HTML_MESSAGE,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (athena.apache.org: transitioning domain of ido.hadanny@gmail.com does not designate 54.191.145.13 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=cUy+sUHQk/r+20MbC9yEiVBP8Gtz98OpyERrAB8MPkc=; b=lXtxsZxBl/6kANMz8eRI33oYKupNbjDx03oiwj5gLgBZHR2Z16f75HZJdtfpZ6My8w vd02XoyAIjDgoZCdAohd6qlqGxEr4WKAfhHsxCLC86s2C7aXBqPNivkVF2BIPjhV4VH+ CbRnkc9iGTOnjZ2u+Mi8DrFQOaS22mnFr/K3zpj1/CAGrXAqq6T9OGRx8QUrbgjh+rbv ZxN2w3k/CbrS2URaCcc+C2JRdloz3/msNaakreqAHUGY2TnU21AiMvMVpEIDCTy803gY E4ZPnHVqQRCJsQNv07T/+Yx/yQcYg+x+3VEq3+tLXY0Vuu5KFGtc8bUQEjQPoMA47P8y 1wRA== MIME-Version: 1.0 X-Received: by 10.202.174.131 with SMTP id x125mr9532372oie.18.1430141195520; Mon, 27 Apr 2015 06:26:35 -0700 (PDT) In-Reply-To: <0E29ACE6-59ED-4CD1-84D3-E8E50514BACA@gmail.com> References: <0E29ACE6-59ED-4CD1-84D3-E8E50514BACA@gmail.com> Date: Mon, 27 Apr 2015 16:26:35 +0300 Message-ID: Subject: Re: why is data.fu implementing HyperLogLog as an accumulator and not as algebraic? From: Ido Hadanny To: Matthew Hayes Cc: "dev@datafu.incubator.apache.org" , ilia , "ihadanny@paypal.com" Content-Type: multipart/mixed; boundary=001a113ce726f02dc40514b4b12c X-Virus-Checked: Checked by ClamAV on apache.org --001a113ce726f02dc40514b4b12c Content-Type: multipart/alternative; boundary=001a113ce726f02dbd0514b4b12a --001a113ce726f02dbd0514b4b12a Content-Type: text/plain; charset=UTF-8 Hey guys, patch is attached + tested on unit-tests + We're testing it on a 1000-nodes real hadoop cluster as we speak. Do you want us to create a jira issue for this, or is this good enough? Thanks, Ilia and Ido On 7 March 2015 at 23:09, Matthew Hayes wrote: > I don't remember if there was a particular reason I didn't implement this > as AlgebraicEvalFunc. It seems like it could be. I believe the Java > MapReduce version leverages the combiner. If you want to try making this > Algebraic we would be happy to accept a patch :) > > -Matt > > > On Mar 7, 2015, at 12:11 PM, Ido Hadanny wrote: > > > > data.fu has a nice implementation of HyperLogLog for estimating > cardinality > > here > > < > https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/stats/HyperLogLogPlusPlus.java > > > > > > However, it's implemented as Accumulator which means it will run only at > > the reducer and not in the combiner (but it will never load the entire > set > > into memory as in normal EvalFunc). Why couldn't data.fu implement it as > > Algebraic - and fill the registers at every combiner, then merge and > reduce > > the result? Am I missing something here? > > also available here: > > > http://stackoverflow.com/questions/28908217/why-is-data-fu-implementing-hyperloglog-as-an-accumulator-and-not-as-algebraic > > > > thanks! > > > > > > -- > > Sent from my androido > -- Sent from my androido --001a113ce726f02dbd0514b4b12a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hey guys,
patch is attached + tes= ted on unit-tests + We're testing it on a 1000-nodes real hadoop cluste= r as we speak.=C2=A0
Do you want us to create a jira issue for th= is, or is this good enough?
Thanks, Ilia and Ido

On 7 March 2015 at 23:09,= Matthew Hayes <matthew.terence.hayes@gmail.com> wrote:
I don't remember if there w= as a particular reason I didn't implement this as AlgebraicEvalFunc. It= seems like it could be. I believe the Java MapReduce version leverages the= combiner. If you want to try making this Algebraic we would be happy to ac= cept a patch :)

-Matt

> On Mar 7, 2015, at 12:11 PM, Ido Hadanny <ido.hadanny@gmail.com> wrote:
>
> data.fu has a nice implementation of HyperLogLog for estimating cardin= ality
> here
> <https://github.com/apache/incubator-datafu/blob/master/da= tafu-pig/src/main/java/datafu/pig/stats/HyperLogLogPlusPlus.java>
>
> However, it's implemented as Accumulator which means it will run o= nly at
> the reducer and not in the combiner (but it will never load the entire= set
> into memory as in normal EvalFunc). Why couldn't data.fu implement= it as
> Algebraic - and fill the registers at every combiner, then merge and r= educe
> the result? Am I missing something here?
> also available here:
> http://stackoverflow.com/questions/28908217/why-is-data-fu-implemen= ting-hyperloglog-as-an-accumulator-and-not-as-algebraic
>
> thanks!
>
>
> --
> Sent from my androido



--
Sent from my androido
--001a113ce726f02dbd0514b4b12a-- --001a113ce726f02dc40514b4b12c Content-Type: text/plain; charset=US-ASCII; name="hyper-log-log-algebraic.diff" Content-Disposition: attachment; filename="hyper-log-log-algebraic.diff" Content-Transfer-Encoding: base64 X-Attachment-Id: f_i8zxb8gt0 ZGlmZiAtLWdpdCBhL2RhdGFmdS1waWcvc3JjL21haW4vamF2YS9kYXRhZnUvcGlnL3N0YXRzL0h5 cGVyTG9nTG9nUGx1c1BsdXMuamF2YSBiL2RhdGFmdS1waWcvc3JjL21haW4vamF2YS9kYXRhZnUv cGlnL3N0YXRzL0h5cGVyTG9nTG9nUGx1c1BsdXMuamF2YQppbmRleCA5NWM1YjBlLi5iYjU4MTBm IDEwMDY0NAotLS0gYS9kYXRhZnUtcGlnL3NyYy9tYWluL2phdmEvZGF0YWZ1L3BpZy9zdGF0cy9I eXBlckxvZ0xvZ1BsdXNQbHVzLmphdmEKKysrIGIvZGF0YWZ1LXBpZy9zcmMvbWFpbi9qYXZhL2Rh dGFmdS9waWcvc3RhdHMvSHlwZXJMb2dMb2dQbHVzUGx1cy5qYXZhCkBAIC0yMCwxNCArMjAsMjQg QEAKIHBhY2thZ2UgZGF0YWZ1LnBpZy5zdGF0czsKIAogaW1wb3J0IGphdmEuaW8uSU9FeGNlcHRp b247CitpbXBvcnQgamF2YS51dGlsLkl0ZXJhdG9yOwogCiBpbXBvcnQgb3JnLmFwYWNoZS5waWcu QWNjdW11bGF0b3JFdmFsRnVuYzsKK2ltcG9ydCBvcmcuYXBhY2hlLnBpZy5FdmFsRnVuYzsKK2lt cG9ydCBvcmcuYXBhY2hlLnBpZy5QaWdFeGNlcHRpb247CitpbXBvcnQgb3JnLmFwYWNoZS5waWcu YmFja2VuZC5leGVjdXRpb25lbmdpbmUuRXhlY0V4Y2VwdGlvbjsKIGltcG9ydCBvcmcuYXBhY2hl LnBpZy5kYXRhLkRhdGFCYWc7CitpbXBvcnQgb3JnLmFwYWNoZS5waWcuZGF0YS5EYXRhQnl0ZUFy cmF5OwogaW1wb3J0IG9yZy5hcGFjaGUucGlnLmRhdGEuRGF0YVR5cGU7CiBpbXBvcnQgb3JnLmFw YWNoZS5waWcuZGF0YS5UdXBsZTsKK2ltcG9ydCBvcmcuYXBhY2hlLnBpZy5kYXRhLlR1cGxlRmFj dG9yeTsKIGltcG9ydCBvcmcuYXBhY2hlLnBpZy5pbXBsLmxvZ2ljYWxMYXllci5Gcm9udGVuZEV4 Y2VwdGlvbjsKIGltcG9ydCBvcmcuYXBhY2hlLnBpZy5pbXBsLmxvZ2ljYWxMYXllci5zY2hlbWEu U2NoZW1hOwogCitpbXBvcnQgY29tLmNsZWFyc3ByaW5nLmFuYWx5dGljcy5oYXNoLk11cm11ckhh c2g7CitpbXBvcnQgY29tLmNsZWFyc3ByaW5nLmFuYWx5dGljcy5zdHJlYW0uY2FyZGluYWxpdHku Q2FyZGluYWxpdHlNZXJnZUV4Y2VwdGlvbjsKK2ltcG9ydCBjb20uY2xlYXJzcHJpbmcuYW5hbHl0 aWNzLnN0cmVhbS5jYXJkaW5hbGl0eS5IeXBlckxvZ0xvZ1BsdXM7CisKIC8qKgogICogQSBVREYg dGhhdCBhcHBsaWVzIHRoZSBIeXBlckxvZ0xvZysrIGNhcmRpbmFsaXR5IGVzdGltYXRpb24gYWxn b3JpdGhtLgogICogCkBAIC00NSw4ICs1NSw5IEBAIGltcG9ydCBvcmcuYXBhY2hlLnBpZy5pbXBs LmxvZ2ljYWxMYXllci5zY2hlbWEuU2NoZW1hOwogcHVibGljIGNsYXNzIEh5cGVyTG9nTG9nUGx1 c1BsdXMgZXh0ZW5kcyBBY2N1bXVsYXRvckV2YWxGdW5jPExvbmc+CiB7CiAgIHByaXZhdGUgY29t LmNsZWFyc3ByaW5nLmFuYWx5dGljcy5zdHJlYW0uY2FyZGluYWxpdHkuSHlwZXJMb2dMb2dQbHVz IGVzdGltYXRvcjsKLSAgCi0gIHByaXZhdGUgZmluYWwgaW50IHA7CisJcHJpdmF0ZSBzdGF0aWMg VHVwbGVGYWN0b3J5IG1UdXBsZUZhY3RvcnkgPSBUdXBsZUZhY3RvcnkuZ2V0SW5zdGFuY2UoKTsK KworCXByaXZhdGUgc3RhdGljIGludCBwOwogICAKICAgLyoqCiAgICAqIENvbnN0cnVjdHMgYSBI eXBlckxvZ0xvZysrIGVzdGltYXRvci4KQEAgLTYxLDkgKzcyLDkgQEAgcHVibGljIGNsYXNzIEh5 cGVyTG9nTG9nUGx1c1BsdXMgZXh0ZW5kcyBBY2N1bXVsYXRvckV2YWxGdW5jPExvbmc+CiAgICAq IAogICAgKiBAcGFyYW0gcCBwcmVjaXNpb24gdmFsdWUKICAgICovCi0gIHB1YmxpYyBIeXBlckxv Z0xvZ1BsdXNQbHVzKFN0cmluZyBwKQorICBwdWJsaWMgSHlwZXJMb2dMb2dQbHVzUGx1cyhTdHJp bmcgcGFyKQogICB7Ci0gICAgdGhpcy5wID0gSW50ZWdlci5wYXJzZUludChwKTsKKyAgICBwID0g SW50ZWdlci5wYXJzZUludChwYXIpOwogICAgIGNsZWFudXAoKTsKICAgfQogICAKQEAgLTExMSw0 ICsxMjIsOTQgQEAgcHVibGljIGNsYXNzIEh5cGVyTG9nTG9nUGx1c1BsdXMgZXh0ZW5kcyBBY2N1 bXVsYXRvckV2YWxGdW5jPExvbmc+CiAgICAgICB0aHJvdyBuZXcgUnVudGltZUV4Y2VwdGlvbihl KTsKICAgICB9CiAgIH0KKyAgCisJcHVibGljIFN0cmluZyBnZXRJbml0aWFsKCkgeworCQlyZXR1 cm4gSW5pdGlhbC5jbGFzcy5nZXROYW1lKCk7CisJfQorCisJcHVibGljIFN0cmluZyBnZXRJbnRl cm1lZCgpIHsKKwkJcmV0dXJuIEludGVybWVkaWF0ZS5jbGFzcy5nZXROYW1lKCk7CisJfQorCisJ cHVibGljIFN0cmluZyBnZXRGaW5hbCgpIHsKKwkJcmV0dXJuIEZpbmFsLmNsYXNzLmdldE5hbWUo KTsKKwl9CisKKwlzdGF0aWMgcHVibGljIGNsYXNzIEluaXRpYWwgZXh0ZW5kcyBFdmFsRnVuYzxU dXBsZT4geworCisJCUBPdmVycmlkZQorCQlwdWJsaWMgVHVwbGUgZXhlYyhUdXBsZSBpbnB1dCkg dGhyb3dzIElPRXhjZXB0aW9uIHsKKwkJCS8vIFNpbmNlIEluaXRpYWwgaXMgZ3VhcmFudGVlZCB0 byBiZSBjYWxsZWQKKwkJCS8vIG9ubHkgaW4gdGhlIG1hcCwgaXQgd2lsbCBiZSBjYWxsZWQgd2l0 aCBhbgorCQkJLy8gaW5wdXQgb2YgYSBiYWcgd2l0aCBhIHNpbmdsZSB0dXBsZSAtIHRoZQorCQkJ Ly8gY291bnQgc2hvdWxkIGFsd2F5cyBiZSAxIGlmIGJhZyBpcyBub24gZW1wdHkKKwkJCURhdGFC YWcgYmFnID0gKERhdGFCYWcpIGlucHV0LmdldCgwKTsKKwkJCUl0ZXJhdG9yPFR1cGxlPiBpdCA9 IGJhZy5pdGVyYXRvcigpOworCQkJaWYgKGl0Lmhhc05leHQoKSkgeworCQkJCVR1cGxlIHQgPSAo VHVwbGUpIGl0Lm5leHQoKTsKKwkJCQlpZiAodCAhPSBudWxsICYmIHQuc2l6ZSgpID4gMCAmJiB0 LmdldCgwKSAhPSBudWxsKSB7CisJCQkJCWxvbmcgeCA9IE11cm11ckhhc2guaGFzaDY0KHQpOwor CQkJCQlyZXR1cm4gbVR1cGxlRmFjdG9yeS5uZXdUdXBsZSgoT2JqZWN0KSB4KTsKKwkJCQl9CisJ CQl9CisJCQlyZXR1cm4gbVR1cGxlRmFjdG9yeS5uZXdUdXBsZSgoT2JqZWN0KSBNdXJtdXJIYXNo Lmhhc2g2NChudWxsKSk7CisJCX0KKwl9CisKKwlzdGF0aWMgcHVibGljIGNsYXNzIEludGVybWVk aWF0ZSBleHRlbmRzIEV2YWxGdW5jPFR1cGxlPiB7CisJCUBPdmVycmlkZQorCQlwdWJsaWMgVHVw bGUgZXhlYyhUdXBsZSBpbnB1dCkgdGhyb3dzIElPRXhjZXB0aW9uIHsKKwkJCXRyeSB7CisJCQkJ RGF0YUJ5dGVBcnJheSBkYXRhID0gbmV3IERhdGFCeXRlQXJyYXkoY291bnREaXNjdGluY3QoaW5w dXQpLmdldEJ5dGVzKCkpOworCQkJCXJldHVybiBtVHVwbGVGYWN0b3J5Lm5ld1R1cGxlKGRhdGEp OworCQkJfSBjYXRjaCAoRXhlY0V4Y2VwdGlvbiBlZSkgeworCQkJCXRocm93IGVlOworCQkJfSBj YXRjaCAoRXhjZXB0aW9uIGUpIHsKKwkJCQlpbnQgZXJyQ29kZSA9IDIxMDY7CisJCQkJU3RyaW5n IG1zZyA9ICJFcnJvciB3aGlsZSBjb21wdXRpbmcgY291bnQgaW4gIgorCQkJCQkJKyB0aGlzLmdl dENsYXNzKCkuZ2V0U2ltcGxlTmFtZSgpOworCQkJCXRocm93IG5ldyBFeGVjRXhjZXB0aW9uKG1z ZywgZXJyQ29kZSwgUGlnRXhjZXB0aW9uLkJVRywgZSk7CisJCQl9CisJCX0KKwl9CisKKwlzdGF0 aWMgcHVibGljIGNsYXNzIEZpbmFsIGV4dGVuZHMgRXZhbEZ1bmM8TG9uZz4geworCQlAT3ZlcnJp ZGUKKwkJcHVibGljIExvbmcgZXhlYyhUdXBsZSBpbnB1dCkgdGhyb3dzIElPRXhjZXB0aW9uIHsK KwkJCXRyeSB7CisJCQkJcmV0dXJuIGNvdW50RGlzY3RpbmN0KGlucHV0KS5jYXJkaW5hbGl0eSgp OworCQkJfSBjYXRjaCAoRXhjZXB0aW9uIGVlKSB7CisJCQkJaW50IGVyckNvZGUgPSAyMTA2Owor CQkJCVN0cmluZyBtc2cgPSAiRXJyb3Igd2hpbGUgY29tcHV0aW5nIGNvdW50IGluICIKKwkJCQkJ CSsgdGhpcy5nZXRDbGFzcygpLmdldFNpbXBsZU5hbWUoKTsKKwkJCQl0aHJvdyBuZXcgRXhlY0V4 Y2VwdGlvbihtc2csIGVyckNvZGUsIFBpZ0V4Y2VwdGlvbi5CVUcsIGVlKTsKKwkJCX0KKwkJfQor CX0KKworCXN0YXRpYyBwcm90ZWN0ZWQgSHlwZXJMb2dMb2dQbHVzIGNvdW50RGlzY3RpbmN0KFR1 cGxlIGlucHV0KQorCQkJdGhyb3dzIE51bWJlckZvcm1hdEV4Y2VwdGlvbiwgSU9FeGNlcHRpb24g eworCQlIeXBlckxvZ0xvZ1BsdXMgZXN0aW1hdG9yID0gbmV3IEh5cGVyTG9nTG9nUGx1cyhwKTsK KwkJRGF0YUJhZyB2YWx1ZXMgPSAoRGF0YUJhZykgaW5wdXQuZ2V0KDApOworCQlmb3IgKEl0ZXJh dG9yPFR1cGxlPiBpdCA9IHZhbHVlcy5pdGVyYXRvcigpOyBpdC5oYXNOZXh0KCk7KSB7CisJCQlU dXBsZSB0ID0gaXQubmV4dCgpOworCQkJT2JqZWN0IGRhdGEgPSB0LmdldCgwKTsKKwkJCWlmIChk YXRhIGluc3RhbmNlb2YgTG9uZykgeworCQkJCWVzdGltYXRvci5vZmZlcihkYXRhKTsKKwkJCX0g ZWxzZSBpZiAoZGF0YSBpbnN0YW5jZW9mIERhdGFCeXRlQXJyYXkpIHsKKwkJCQlEYXRhQnl0ZUFy cmF5IGJ5dGVzID0gKERhdGFCeXRlQXJyYXkpIGRhdGE7CisJCQkJSHlwZXJMb2dMb2dQbHVzIG5l d0VzdGltYXRvcjsKKwkJCQl0cnkgeworCQkJCQluZXdFc3RpbWF0b3IgPSBIeXBlckxvZ0xvZ1Bs dXMuQnVpbGRlci5idWlsZChieXRlcy5nZXQoKSk7CisJCQkJCWVzdGltYXRvciA9IChIeXBlckxv Z0xvZ1BsdXMpIGVzdGltYXRvci5tZXJnZShuZXdFc3RpbWF0b3IpOworCQkJCX0gY2F0Y2ggKElP RXhjZXB0aW9uIGUpIHsKKwkJCQkJdGhyb3cgbmV3IFJ1bnRpbWVFeGNlcHRpb24oZSk7CisJCQkJ fSBjYXRjaCAoQ2FyZGluYWxpdHlNZXJnZUV4Y2VwdGlvbiBlKSB7CisJCQkJCXRocm93IG5ldyBS dW50aW1lRXhjZXB0aW9uKGUpOworCQkJCX0KKwkJCX0KKwkJfQorCQlyZXR1cm4gZXN0aW1hdG9y OworCX0KKyAgCiB9Cg== --001a113ce726f02dc40514b4b12c--