accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Medinets <david.medin...@gmail.com>
Subject Re: Tracking cardinality in Accumulo
Date Fri, 16 May 2014 21:19:20 GMT
Yes, the data has not yet been ingested. I can control the table structure;
hopefully by integrating (or extending) the D4M schema.

I'm leaning towards using https://github.com/addthis/stream-lib as part of
the ingest process. Upon start up, existing tables would be analyzed to
find cardinality. Then as records are ingested, the cardinality would be
adjusted as needed. I don't yet know how to store the cardinality
information so that restarting the ingest process doesn't require
re-processing all the data. Still researching.


On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <cjnolet@gmail.com> wrote:

> Can we assume this data has not yet been ingested? Do you have control
> over the way in which you structure your table?
>
>
>
> On Fri, May 16, 2014 at 1:54 PM, David Medinets <david.medinets@gmail.com>wrote:
>
>> If I have the following simple set of data:
>>
>> NAME John
>> NAME Jake
>> NAME John
>> NAME Mary
>>
>> I want to end up with the following:
>>
>> NAME 3
>>
>> I'm thinking that perhaps a HyperLogLog approach should work. See
>> http://en.wikipedia.org/wiki/HyperLogLog for more information.
>>
>> Has anyone done this before in Accumulo?
>>
>
>

Mime
View raw message