accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Accumulo indexing social media data
Date Mon, 06 Jul 2015 14:39:27 GMT
Sorry, I'm not that familiar with the D4M schema.

Regarding partitioning, I agree with Josh's response.


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Thu, Jul 2, 2015 at 8:34 AM, mohit.kaushik <mohit.kaushik@orkash.com>
wrote:

>  Christopher,
>
> What I understood from the Medinets explanation of reverse sorting is
> first he subtracts every character from 255 to make it reverse index. and
> also append the original UUID to that string. When I checked the D4M
> schema, it prints ??????? in the front of UUID which I suppose the
> characters subtracted from 255, if I am not misunderstood.
>
> And the problem with 60 bytes or 120 bytes is nothing. I just don't want
> to waste space for no benefits at all.  when It can be done in 12 or 13
> bytes. And Thanks I looked at the MongoDriver code. I supposed that the
> encoding may not fit to lexicographical sorting.
>
> Can you please provide some inputs on deciding the Partition Id?
>
> -Mohit Kaushik
>
>
> On 07/01/2015 09:55 PM, Christopher wrote:
>
> I'm not sure I understand why the size would be doubled.... if you
> store it in reverse order, it's not going to take up more bytes. Are
> you storing it forward *and* reverse? If so, why?
>
> Also, forgive me for asking, but 60 bytes doesn't seem problematic to
> me... that's going to be compressed on disk anyway. Why is 60 bytes
> too large for your use case?
>
> Also, if the MongoDB 12 byte UUID was sufficient, why aren't you using
> a UUID that is in that same format?
>
> Regarding serving documents based on a single term query... it seems
> to me that if that is your only requirement, then a row which looks
> like "<term> <UUID>" would be more appropriate, since the best way to
> support single-term query is to index on that term (UUID added only to
> enable rows to split).
>
> --
> Christopher L Tubbs IIhttp://gravatar.com/ctubbsii
>
>
> On Wed, Jul 1, 2015 at 3:16 AM, mohit.kaushik <mohit.kaushik@orkash.com> <mohit.kaushik@orkash.com>
wrote:
>
>  Hi,
>
> We have an social media application currently using MongoDB to serve
> documents . We decided to shift it to Accumulo. I am designing the schema
> and indexing approach but having some difficulties in managing indexes and a
> few concerns with generating UUID in Accumulo.
>
> UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates a 12
> byte UUID sorted on current time and good for multi-user multi-process
> environment (<time>     <Mac add>    <process id>  <client counter>
) which
> is perfect. but if I concatenate  the time,mac add, process-id, client
> counter. These are around 28 to 30 characters which means around 60 bytes.
> And If I store it in reverse order so that the latest document shows on top,
> the size would be doubled( more than 120 bytes) as described by David
> Medinets. Is there any way to store this UUID in lesser size or any other
> efficient way to generate UUID reverse sorted on current time.
>
> Indexing : I need to retrieve documents from index based on some query on
> fields. I found two approaches to index documents in Accumulo.
> (1) Term based reverse indexing and
> (2) Document partitioning indexing
>
> As Adam described in this video https://www.youtube.com/watch?v=Ck70G6OuGT4.
> If I use Document partitioning indexing.
>
> Row                    <partition id>
>                                /            \
> CF                 <doc>            <index>
>                            |                       |
> CQ                <UUID>          <Term>
>                            |                       |
>                       <field>           <UUID>
>                            |                        |
>                            |                  <Field>
> Value            <value>
>
> If I just want to serve documents based on single term query. Would it be
> better to store <term> in column family so that I can limit on single term
> in CF. It will  reduce the data by a good factor. what can be other pros and
> cons of this approach?
> And how should i decide the on partition_Id. If i storing tweets on 3 node
> cluster?
>
> Regards
> Mohit Kaushik
>
>
>
>
> --
>
> * Mohit Kaushik*
> Software Engineer
> A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
> *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553
>
>  <http://politicomapper.orkash.com>interactive social intelligence at
> work...
>
>  <https://www.facebook.com/Orkash2012>
> <http://www.linkedin.com/company/orkash-services-private-limited>
> <https://twitter.com/Orkash>  <http://www.orkash.com/blog/>
> <http://www.orkash.com>
>  <http://www.orkash.com> ... ensuring Assurance in complexity and
> uncertainty
>
> *This message including the attachments, if any, is a confidential
> business communication. If you are not the intended recipient it may be
> unlawful for you to read, copy, distribute, disclose or otherwise use the
> information in this e-mail. If you have received it in error or are not the
> intended recipient, please destroy it and notify the sender immediately.
> Thank you *
>

Mime
View raw message