accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <>
Subject Re: Accumulo indexing social media data
Date Wed, 01 Jul 2015 16:25:10 GMT
I'm not sure I understand why the size would be doubled.... if you
store it in reverse order, it's not going to take up more bytes. Are
you storing it forward *and* reverse? If so, why?

Also, forgive me for asking, but 60 bytes doesn't seem problematic to
me... that's going to be compressed on disk anyway. Why is 60 bytes
too large for your use case?

Also, if the MongoDB 12 byte UUID was sufficient, why aren't you using
a UUID that is in that same format?

Regarding serving documents based on a single term query... it seems
to me that if that is your only requirement, then a row which looks
like "<term> <UUID>" would be more appropriate, since the best way to
support single-term query is to index on that term (UUID added only to
enable rows to split).

Christopher L Tubbs II

On Wed, Jul 1, 2015 at 3:16 AM, mohit.kaushik <> wrote:
> Hi,
> We have an social media application currently using MongoDB to serve
> documents . We decided to shift it to Accumulo. I am designing the schema
> and indexing approach but having some difficulties in managing indexes and a
> few concerns with generating UUID in Accumulo.
> UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates a 12
> byte UUID sorted on current time and good for multi-user multi-process
> environment (<time>     <Mac add>    <process id>  <client counter>
) which
> is perfect. but if I concatenate  the time,mac add, process-id, client
> counter. These are around 28 to 30 characters which means around 60 bytes.
> And If I store it in reverse order so that the latest document shows on top,
> the size would be doubled( more than 120 bytes) as described by David
> Medinets. Is there any way to store this UUID in lesser size or any other
> efficient way to generate UUID reverse sorted on current time.
> Indexing : I need to retrieve documents from index based on some query on
> fields. I found two approaches to index documents in Accumulo.
> (1) Term based reverse indexing and
> (2) Document partitioning indexing
> As Adam described in this video
> If I use Document partitioning indexing.
> Row                    <partition id>
>                                /            \
> CF                 <doc>            <index>
>                            |                       |
> CQ                <UUID>          <Term>
>                            |                       |
>                       <field>           <UUID>
>                            |                        |
>                            |                  <Field>
> Value            <value>
> If I just want to serve documents based on single term query. Would it be
> better to store <term> in column family so that I can limit on single term
> in CF. It will  reduce the data by a good factor. what can be other pros and
> cons of this approach?
> And how should i decide the on partition_Id. If i storing tweets on 3 node
> cluster?
> Regards
> Mohit Kaushik

View raw message