accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mohit.kaushik" <mohit.kaus...@orkash.com>
Subject Re: Accumulo indexing social media data
Date Thu, 02 Jul 2015 12:34:08 GMT
Christopher,

What I understood from the Medinets explanation of reverse sorting is 
first he subtracts every character from 255 to make it reverse index. 
and also append the original UUID to that string. When I checked the D4M 
schema, it prints ??????? in the front of UUID which I suppose the 
characters subtracted from 255, if I am not misunderstood.

And the problem with 60 bytes or 120 bytes is nothing. I just don't want 
to waste space for no benefits at all.  when It can be done in 12 or 13 
bytes. And Thanks I looked at the MongoDriver code. I supposed that the 
encoding may not fit to lexicographical sorting.

Can you please provide some inputs on deciding the Partition Id?

-Mohit Kaushik

On 07/01/2015 09:55 PM, Christopher wrote:
> I'm not sure I understand why the size would be doubled.... if you
> store it in reverse order, it's not going to take up more bytes. Are
> you storing it forward *and* reverse? If so, why?
>
> Also, forgive me for asking, but 60 bytes doesn't seem problematic to
> me... that's going to be compressed on disk anyway. Why is 60 bytes
> too large for your use case?
>
> Also, if the MongoDB 12 byte UUID was sufficient, why aren't you using
> a UUID that is in that same format?
>
> Regarding serving documents based on a single term query... it seems
> to me that if that is your only requirement, then a row which looks
> like "<term> <UUID>" would be more appropriate, since the best way to
> support single-term query is to index on that term (UUID added only to
> enable rows to split).
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Wed, Jul 1, 2015 at 3:16 AM, mohit.kaushik <mohit.kaushik@orkash.com> wrote:
>> Hi,
>>
>> We have an social media application currently using MongoDB to serve
>> documents . We decided to shift it to Accumulo. I am designing the schema
>> and indexing approach but having some difficulties in managing indexes and a
>> few concerns with generating UUID in Accumulo.
>>
>> UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates a 12
>> byte UUID sorted on current time and good for multi-user multi-process
>> environment (<time>     <Mac add>    <process id>  <client counter>
) which
>> is perfect. but if I concatenate  the time,mac add, process-id, client
>> counter. These are around 28 to 30 characters which means around 60 bytes.
>> And If I store it in reverse order so that the latest document shows on top,
>> the size would be doubled( more than 120 bytes) as described by David
>> Medinets. Is there any way to store this UUID in lesser size or any other
>> efficient way to generate UUID reverse sorted on current time.
>>
>> Indexing : I need to retrieve documents from index based on some query on
>> fields. I found two approaches to index documents in Accumulo.
>> (1) Term based reverse indexing and
>> (2) Document partitioning indexing
>>
>> As Adam described in this video https://www.youtube.com/watch?v=Ck70G6OuGT4.
>> If I use Document partitioning indexing.
>>
>> Row                    <partition id>
>>                                 /            \
>> CF                 <doc>            <index>
>>                             |                       |
>> CQ                <UUID>          <Term>
>>                             |                       |
>>                        <field>           <UUID>
>>                             |                        |
>>                             |                  <Field>
>> Value            <value>
>>
>> If I just want to serve documents based on single term query. Would it be
>> better to store <term> in column family so that I can limit on single term
>> in CF. It will  reduce the data by a good factor. what can be other pros and
>> cons of this approach?
>> And how should i decide the on partition_Id. If i storing tweets on 3 node
>> cluster?
>>
>> Regards
>> Mohit Kaushik
>>
>


-- 
Signature

*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553

<http://politicomapper.orkash.com>interactive social intelligence at work...

<https://www.facebook.com/Orkash2012> 
<http://www.linkedin.com/company/orkash-services-private-limited> 
<https://twitter.com/Orkash> <http://www.orkash.com/blog/> 
<http://www.orkash.com>
<http://www.orkash.com> ... ensuring Assurance in complexity and uncertainty

/This message including the attachments, if any, is a confidential 
business communication. If you are not the intended recipient it may be 
unlawful for you to read, copy, distribute, disclose or otherwise use 
the information in this e-mail. If you have received it in error or are 
not the intended recipient, please destroy it and notify the sender 
immediately. Thank you /


Mime
View raw message