accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mohit.kaushik" <mohit.kaus...@orkash.com>
Subject Re: Accumulo indexing social media data
Date Wed, 08 Jul 2015 12:12:20 GMT
Thanks Josh, I am testing the approach. I have one more consideration 
which is "CONDITIONAL MUTATIONS". I have stored the fields in CQ 
according to the following schema.

Row     <partition id>
/ \
CF <doc> <index>
| |
CQ <UUID> <Term>
| |
<field> <UUID>
| |
| <Field>
Value <value>


Documents have a fields url. if the url exist. I want mutations not to 
be added(skiped). But as I don't know the partitionID. How can I apply 
conditional mutations here to check the existence of url.

-Mohit kaushik

On 07/06/2015 02:04 AM, Josh Elser wrote:
> If your primary search criteria is on a single-term, a term-based 
> reverse index is going to serve you much better than a 
> document-partitioned index.
>
> Document partitioned indexes can better support concurrency since you 
> have some amount of hash-partitioning involved in the partition ID 
> (sometimes you can include other data in the partition ID to further 
> restrict the "search space"). However, you always need to query each 
> partition to get an answer for a single term. You'll have much higher 
> latency using this approach than a term-partitioned index.
>
> To answer your question about choosing a partition ID, it typically 
> revolves around the number of TabletServers you want a single query to 
> parallelize on. For example, if you can assume to have ~10 queries 
> running at one time, you don't want each query to communicate with 90% 
> of your TabletServers. If you only run one or two queries at a time, 
> you would want to talk to as many TabletServers as you can.
>
> To further complicate things, you can also try to apply a partition ID 
> as a suffix on term-based indexes to work around queries such as "the" 
> or "and" which are prone to be extremely common terms. With a simple 
> term-based index, all records for this term would be contained in a 
> single Tablet on a single TabletServer. This ultimately comes down to 
> the amount and distribution of data you're storing.
>
> Come back with more information, and we can give some more 
> recommendations. Honestly, you probably won't get this right the first 
> time, but this is expected :). What you can do is..
>
> * Set some expectations on performance
> * Do some simple math on actual data (estimate parallelism, latency, etc)
> * Prototype and test it
>
> mohit.kaushik wrote:
>> Hi,
>>
>> We have an social media application currently using MongoDB to serve
>> documents . We decided to shift it to Accumulo. I am designing the
>> schema and indexing approach but having some difficulties in managing
>> indexes and a few concerns with generating UUID in Accumulo.
>>
>> UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates
>> a 12 byte UUID sorted on current time and good for multi-user
>> multi-process environment (<time> <Mac add> <process id> <client
>> counter> ) which is perfect. but if I concatenate the time,mac add,
>> process-id, client counter. These are around 28 to 30 characters which
>> means around 60 bytes. And If I store it in reverse order so that the
>> latest document shows on top, the size would be doubled( more than 120
>> bytes) as described by David Medinets. Is there any way to store this
>> UUID in lesser size or any other efficient way to generate UUID reverse
>> sorted on current time.
>>
>> Indexing : I need to retrieve documents from index based on some query
>> on fields. I found two approaches to index documents in Accumulo.
>> (1) Term based reverse indexing and
>> (2) Document partitioning indexing
>>
>> As Adam described in this video
>> https://www.youtube.com/watch?v=Ck70G6OuGT4. If I use Document
>> partitioning indexing.
>>
>> Row <partition id>
>> / \
>> CF <doc> <index>
>> | |
>> CQ <UUID> <Term>
>> | |
>> <field> <UUID>
>> | |
>> | <Field>
>> Value <value>
>>
>> If I just want to serve documents based on single term query. Would it
>> be better to store <term> in column family so that I can limit on single
>> term in CF. It will reduce the data by a good factor. what can be other
>> pros and cons of this approach?
>> And how should i decide the on partition_Id. If i storing tweets on 3
>> node cluster?
>>
>> Regards
>> Mohit Kaushik
>>
>


-- 
Signature

*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553

<http://politicomapper.orkash.com>interactive social intelligence at work...

<https://www.facebook.com/Orkash2012> 
<http://www.linkedin.com/company/orkash-services-private-limited> 
<https://twitter.com/Orkash> <http://www.orkash.com/blog/> 
<http://www.orkash.com>
<http://www.orkash.com> ... ensuring Assurance in complexity and uncertainty

/This message including the attachments, if any, is a confidential 
business communication. If you are not the intended recipient it may be 
unlawful for you to read, copy, distribute, disclose or otherwise use 
the information in this e-mail. If you have received it in error or are 
not the intended recipient, please destroy it and notify the sender 
immediately. Thank you /


Mime
View raw message