accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mohit.kaushik" <>
Subject Accumulo indexing social media data
Date Wed, 01 Jul 2015 07:16:36 GMT

We have an social media application currently using MongoDB to serve 
documents . We decided to shift it to Accumulo. I am designing the 
schema and indexing approach but having some difficulties in managing 
indexes and a few concerns with generating UUID in Accumulo.

UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates 
a 12 byte UUID sorted on current time and good for multi-user 
multi-process environment (<time>     <Mac add>    <process id>  <client

counter> ) which is perfect. but if I concatenate  the time,mac add, 
process-id, client counter. These are around 28 to 30 characters which 
means around 60 bytes. And If I store it in reverse order so that the 
latest document shows on top, the size would be doubled( more than 120 
bytes) as described by David Medinets. Is there any way to store this 
UUID in lesser size or any other efficient way to generate UUID reverse 
sorted on current time.

Indexing : I need to retrieve documents from index based on some query 
on fields. I found two approaches to index documents in Accumulo.
(1) Term based reverse indexing and
(2) Document partitioning indexing

As Adam described in this video If I use Document 
partitioning indexing.

Row                    <partition id>
                                /            \
CF                 <doc>            <index>
                            |                       |
CQ                <UUID>          <Term>
                            |                       |
                       <field>           <UUID>
                            |                        |
                            |                  <Field>
Value            <value>

If I just want to serve documents based on single term query. Would it 
be better to store <term> in column family so that I can limit on single 
term in CF. It will  reduce the data by a good factor. what can be other 
pros and cons of this approach?
And how should i decide the on partition_Id. If i storing tweets on 3 
node cluster?

Mohit Kaushik

View raw message