accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: sorting in Accumulo
Date Tue, 06 Mar 2012 19:00:32 GMT
Another way around the duplicate issue that Jason pointed out is to
modify the Versioning iterator to keep more than one version.  You
could set max versions to MAX_LONG.  Do this instead of putting the
md5 in the key.  This way, even if the timestamp is the same you will
still keep the data.

The only problem with this is if you insert the exact same
column/value in a mutation twice only one will be kept as described in
ACCUMULO-227.  Otherwise all versions of a key will be kept.


On Tue, Mar 6, 2012 at 1:06 PM, Jason Trost <jason.trost@gmail.com> wrote:
> You could ingest this data into accumulo using the following "schema"
>
> row:       timestamp
> colfam:  "record"
> colqual: md5(JSON)
> value:   JSON record
>
> Accumulo would sort this for you in lexicographical order by timestamp
> (stored as a string). Depending on the range your data comes from, if
> all the epoch timestamps are the same length, then lexigraphical
> should equal numeric sorting.  If this is not the case for you, then
> you could convert your timestamps to a string using the following
> template (with each field zero padded to its max length):
>
> ${year}${month}{$day}${hour}${minute}${second}
>
> The md5(JSON) is there b/c I assume some of your events could have the
> same timestamp.  If you could have events that are exactly the same
> (and you need to track this) you may want to append a one-up counter
> to the md5 just to gurantee that you won't overwritten duplicates.
> Without the md5 (or another simialr mechanism), Accumulo would
> overwrite any previously stored values with the exact same [row,
> colfam, colqual, colvis].
>
> Iterating in temporal order would just be a simple full table scan.
>
> I hope this helps.
>
> --Jason
>
> On Tue, Mar 6, 2012 at 12:15 PM, John R. Frank <jrf@mit.edu> wrote:
>> Accumulo Experts,
>>
>> Is there an example of working with a time-ordered stream in Accumulo?
>>
>>
>> Given:
>>        ~500M JSON records each about 30kb
>>        each record hasa timestamp field (seconds since the epoch)
>>
>>
>> Goal:
>>        iterate over all records in temporal order
>>        run some function on this simulated stream
>>
>>
>> Thanks for any pointers or advice!
>>
>> John

Mime
View raw message