accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John R. Frank" <...@mit.edu>
Subject Re: sorting in Accumulo
Date Fri, 09 Mar 2012 14:52:11 GMT
On Tue, Mar 6, 2012 at 1:06 PM, Jason Trost <jason.trost@gmail.com> wrote:
> You could ingest this data into accumulo using the following "schema"
>
> row:     timestamp
> colfam:  "record"
> colqual: md5(JSON)
> value:   JSON record


We do have records with same timestamp, so yes collisions occur at that 
level.

We also have a "stream_id" field which is a unique ID constructed from 
integer timestamp and md5 of the abs_url from which the content was 
fetched -- for our corpus that is sufficiently unique that collisions 
occur with essentially zero probability.


stream_id = 123456789-AAAABBBBCCCCDDDDEEEEFFFF0000
             ^^^^^^^^^
             timestamp

I could convert the stream_id to be zero padded to the left to ensure that 
the integer is always fixed length.  If we do that, do we need colqual?

Sounds like this schema be sufficient for sorting in temporal order with 
no meaningful order within a given second -- that would be fine for our 
purposes.


row:     stream_id
colfam:  "record"
value:   JSON record


Thanks for all the responses!

jrf


Mime
View raw message