accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Medinets <david.medin...@gmail.com>
Subject Re: Tracking cardinality in Accumulo
Date Sat, 17 May 2014 12:40:15 GMT
>What's the expected size of your unique key set? Thousands? Millions?
Billions?

This project is something to occupy me my spare time. And it's intended to
explore aspects of Accumulo that I haven't needed to use yet. In the past,
I simply ran a map-reduce job using the Word Counting technique.

tl;dr - The expected size of the unique key key would be in the millions.
Too large to calculate on-the-fly for a web application.


On Fri, May 16, 2014 at 6:04 PM, Corey Nolet <cjnolet@gmail.com> wrote:

> What's the expected size of your unique key set? Thousands? Millions?
> Billions?
>
> You could probably use a table structure similar to
> https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut just
have it emit 1's instead of summing them.
>
> I'm thinking maybe your mappings could be like this:
> group=anything, type=NAME, name=John(etc...)
>
> perhaps a ColumnQualifierGrouping iterator could be applied at scan time
> to add up the cardinalities for the quals over the given time range being
> scanned where cardinalities across different time units get aggregated
> client side.
>
>
>
>
> On Fri, May 16, 2014 at 5:19 PM, David Medinets <david.medinets@gmail.com>wrote:
>
>> Yes, the data has not yet been ingested. I can control the table
>> structure; hopefully by integrating (or extending) the D4M schema.
>>
>> I'm leaning towards using https://github.com/addthis/stream-lib as part
>> of the ingest process. Upon start up, existing tables would be analyzed to
>> find cardinality. Then as records are ingested, the cardinality would be
>> adjusted as needed. I don't yet know how to store the cardinality
>> information so that restarting the ingest process doesn't require
>> re-processing all the data. Still researching.
>>
>>
>> On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <cjnolet@gmail.com> wrote:
>>
>>> Can we assume this data has not yet been ingested? Do you have control
>>> over the way in which you structure your table?
>>>
>>>
>>>
>>> On Fri, May 16, 2014 at 1:54 PM, David Medinets <
>>> david.medinets@gmail.com> wrote:
>>>
>>>> If I have the following simple set of data:
>>>>
>>>> NAME John
>>>> NAME Jake
>>>> NAME John
>>>> NAME Mary
>>>>
>>>> I want to end up with the following:
>>>>
>>>> NAME 3
>>>>
>>>> I'm thinking that perhaps a HyperLogLog approach should work. See
>>>> http://en.wikipedia.org/wiki/HyperLogLog for more information.
>>>>
>>>> Has anyone done this before in Accumulo?
>>>>
>>>
>>>
>>
>

Mime
View raw message