Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <62C288C441C240D9998F0F6066F9A487@gmail.com>
References: 
 <CAKsb8n4JDLHN_aKwAYxJPFbYRdkOREXuZTVD4w_pt9gj0Y3CTQ@mail.gmail.com>
 <62C288C441C240D9998F0F6066F9A487@gmail.com>
From: =?UTF-8?Q?Utku_Can_Top=C3=A7u?= <utku@topcu.gen.tr>
Date: Wed, 13 Jun 2012 19:15:03 +0200
Message-ID: 
 <CAKsb8n40WhArb16B5OB5ngTYrzWSyD9XTiNJ8jJTMef-L7S+jw@mail.gmail.com>
Subject: Re: Distinct Counter Proposal for Cassandra
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=00248c6a66ba85960404c25dbad5

--00248c6a66ba85960404c25dbad5
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Yuki,

I think I should have used the word discussion instead of proposal for the
mailing subject. I have quite some of a design in my mind but I think it's
not yet ripe enough to formalize. I'll try to simplify it and open a Jira
ticket.
But first I'm wondering if there would be any excitement in the community
for such a feature.

Regards,
Utku

On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <mor.yuki@gmail.com> wrote:

> You can open JIRA ticket at
> https://issues.apache.org/jira/browse/CASSANDRA with your proposal.
>
> Just for the input:
>
> I had once implemented HyperLogLog counter to use internally in Cassandra=
,
> but it turned out I didn't need it so I just put it to gist. You can find
> it here: https://gist.github.com/2597943
>
> The above implementation and most of the other ones (including stream-lib=
)
> implement the optimized version of the algorithm which counts up to 10^9,
> so may need some work.
>
> Other alternative is self-learning bitmap (
> http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my
> understanding, is more memory efficient when counting small values.
>
> Yuki
>
> On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Top=C3=A7u wrote:
>
> Hi All,
>
> Let's assume we have a use case where we need to count the number of
> columns for a given key. Let's say the key is the URL and the column-name
> is the IP address or any cardinality identifier.
>
> The straight forward implementation seems to be simple, just inserting th=
e
> IP Adresses as columns under the key defined by the URL and using get_cou=
nt
> to count them back. However the problem here is in case of large rows
> (where too many IP addresses are in); the get_count method has to
> de-serialize the whole row and calculate the count. As also defined in th=
e
> user guides, it's not an O(1) operation and it's quite costly.
>
> However, this problem seems to have better solutions if you don't have a
> strict requirement for the count to be exact. There are streaming
> algorithms that will provide good cardinality estimations within a
> predefined failure rate, I think the most popular one seems to be the
> (Hyper)LogLog algorithm, also there's an optimal one developed recently,
> please check http://dl.acm.org/citation.cfm?doid=3D1807085.1807094
>
> If you want to take a look at the Java implementation for LogLog,
> Clearspring has both LogLog and space optimized HyperLogLog available at
> https://github.com/clearspring/stream-lib
>
> I don't see a reason why this can't be implemented in Cassandra. The
> distributed nature of all these algorithms can easily be adapted to
> Cassandra's model. I think most of us would love to see come cardinality
> estimating columns in Cassandra.
>
> Regards,
> Utku
>
>
>

--00248c6a66ba85960404c25dbad5
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Yuki,<br><br>I think I should have used the word discussion instead of p=
roposal for the mailing subject. I have quite some of a design in my mind b=
ut I think it&#39;s not yet ripe enough to formalize. I&#39;ll try to simpl=
ify it and open a Jira ticket.<br>

But first I&#39;m wondering if there would be any excitement in the communi=
ty for such a feature.<br><br>Regards,<br>Utku<br><br><div class=3D"gmail_q=
uote">On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <span dir=3D"ltr">&lt=
;<a href=3D"mailto:mor.yuki@gmail.com" target=3D"_blank">mor.yuki@gmail.com=
</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
                <div>You can open JIRA ticket at=C2=A0<a href=3D"https://is=
sues.apache.org/jira/browse/CASSANDRA" target=3D"_blank">https://issues.apa=
che.org/jira/browse/CASSANDRA</a>=C2=A0with your proposal.</div><div><div><=
br></div>

<div>Just for the input:</div><div><br></div><div>I had once implemented Hy=
perLogLog counter to use internally in Cassandra, but it turned out I didn&=
#39;t need it so I just put it to gist. You can find it here:=C2=A0<a href=
=3D"https://gist.github.com/2597943" target=3D"_blank">https://gist.github.=
com/2597943</a></div>

<div><br></div><div>The above implementation and most of the other ones (in=
cluding stream-lib) implement the optimized version of the algorithm which =
counts up to 10^9, so may need some work.</div><div><br></div><div>Other al=
ternative is self-learning bitmap (<a href=3D"http://ect.bell-labs.com/who/=
aychen/sbitmap4p.pdf" target=3D"_blank">http://ect.bell-labs.com/who/aychen=
/sbitmap4p.pdf</a>) which, in my understanding, is more memory efficient wh=
en counting small values.</div>

<span class=3D"HOEnZb"><font color=3D"#888888"><div><br></div><div>Yuki</di=
v><div><br></div></font></span></div><div class=3D"HOEnZb"><div class=3D"h5=
">
                =20
                <p style=3D"color:#a0a0a8">On Wednesday, June 13, 2012 at 1=
1:28 AM, Utku Can Top=C3=A7u wrote:</p>
                <blockquote type=3D"cite" style=3D"border-left-style:solid;=
border-width:1px;margin-left:0px;padding-left:10px">
                    <span><div><div>Hi All,<br><br>Let&#39;s assume we have=
 a use case where we need to count the number of columns for a given key. L=
et&#39;s say the key is the URL and the column-name is the IP address or an=
y cardinality identifier.<br>

<br>

The straight forward implementation seems to be simple, just inserting the =
IP Adresses as columns under the key defined by the URL and using get_count=
 to count them back. However the problem here is in case of large rows (whe=
re too many IP addresses are in); the get_count method has to de-serialize =
the whole row and calculate the count. As also defined in the user guides, =
it&#39;s not an O(1) operation and it&#39;s quite costly.<br>


<br>However, this problem seems to have better solutions if you don&#39;t h=
ave a strict requirement for the count to be exact. There are streaming alg=
orithms that will provide good cardinality estimations within a predefined =
failure rate, I think the most popular one seems to be the (Hyper)LogLog al=
gorithm, also there&#39;s an optimal one developed recently, please check <=
a href=3D"http://dl.acm.org/citation.cfm?doid=3D1807085.1807094" target=3D"=
_blank">http://dl.acm.org/citation.cfm?doid=3D1807085.1807094</a><br>


<br>If you want to take a look at the Java implementation for LogLog, Clear=
spring has both LogLog and space optimized HyperLogLog available at <a href=
=3D"https://github.com/clearspring/stream-lib" target=3D"_blank">https://gi=
thub.com/clearspring/stream-lib</a><br>


<br>I don&#39;t see a reason why this can&#39;t be implemented in Cassandra=
. The distributed nature of all these algorithms can easily be adapted to C=
assandra&#39;s model. I think most of us would love to see come cardinality=
 estimating columns in Cassandra.<br>


<br>Regards,<br>Utku<br>
</div></div></span>
                =20
                =20
                =20
                =20
                </blockquote>
                =20
                <div>
                    <br>
                </div>
            </div></div></blockquote></div><br>

--00248c6a66ba85960404c25dbad5--