accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Parisi <m...@accumulo.net>
Subject Re: Tracking cardinality in Accumulo
Date Sat, 17 May 2014 01:20:03 GMT
woops, sorry for the empty response, but I'm new to E-mail. The bitset
within HLL supports union and intersection. You should be able to estimate
cardinality without re-reading the data. In effect, you can segment your
estimation and minimize error < about 2%.

Union is straightforward, whereas intersection is |FIELD+1| + |FIELD_2| -
|FIELD_1 UNION FIELD_2|


On Fri, May 16, 2014 at 9:17 PM, Marc Parisi <marc@accumulo.net> wrote:

>
>
>
> On Fri, May 16, 2014 at 6:04 PM, Corey Nolet <cjnolet@gmail.com> wrote:
>
>> What's the expected size of your unique key set? Thousands? Millions?
>> Billions?
>>
>> You could probably use a table structure similar to
>> https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut
just have it emit 1's instead of summing them.
>>
>> I'm thinking maybe your mappings could be like this:
>> group=anything, type=NAME, name=John(etc...)
>>
>> perhaps a ColumnQualifierGrouping iterator could be applied at scan time
>> to add up the cardinalities for the quals over the given time range being
>> scanned where cardinalities across different time units get aggregated
>> client side.
>>
>>
>>
>>
>> On Fri, May 16, 2014 at 5:19 PM, David Medinets <david.medinets@gmail.com
>> > wrote:
>>
>>> Yes, the data has not yet been ingested. I can control the table
>>> structure; hopefully by integrating (or extending) the D4M schema.
>>>
>>> I'm leaning towards using https://github.com/addthis/stream-lib as part
>>> of the ingest process. Upon start up, existing tables would be analyzed to
>>> find cardinality. Then as records are ingested, the cardinality would be
>>> adjusted as needed. I don't yet know how to store the cardinality
>>> information so that restarting the ingest process doesn't require
>>> re-processing all the data. Still researching.
>>>
>>>
>>> On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <cjnolet@gmail.com> wrote:
>>>
>>>> Can we assume this data has not yet been ingested? Do you have control
>>>> over the way in which you structure your table?
>>>>
>>>>
>>>>
>>>> On Fri, May 16, 2014 at 1:54 PM, David Medinets <
>>>> david.medinets@gmail.com> wrote:
>>>>
>>>>> If I have the following simple set of data:
>>>>>
>>>>> NAME John
>>>>> NAME Jake
>>>>> NAME John
>>>>> NAME Mary
>>>>>
>>>>> I want to end up with the following:
>>>>>
>>>>> NAME 3
>>>>>
>>>>> I'm thinking that perhaps a HyperLogLog approach should work. See
>>>>> http://en.wikipedia.org/wiki/HyperLogLog for more information.
>>>>>
>>>>> Has anyone done this before in Accumulo?
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message