accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Accumulo indexing social media data
Date Sun, 05 Jul 2015 20:34:03 GMT
If your primary search criteria is on a single-term, a term-based 
reverse index is going to serve you much better than a 
document-partitioned index.

Document partitioned indexes can better support concurrency since you 
have some amount of hash-partitioning involved in the partition ID 
(sometimes you can include other data in the partition ID to further 
restrict the "search space"). However, you always need to query each 
partition to get an answer for a single term. You'll have much higher 
latency using this approach than a term-partitioned index.

To answer your question about choosing a partition ID, it typically 
revolves around the number of TabletServers you want a single query to 
parallelize on. For example, if you can assume to have ~10 queries 
running at one time, you don't want each query to communicate with 90% 
of your TabletServers. If you only run one or two queries at a time, you 
would want to talk to as many TabletServers as you can.

To further complicate things, you can also try to apply a partition ID 
as a suffix on term-based indexes to work around queries such as "the" 
or "and" which are prone to be extremely common terms. With a simple 
term-based index, all records for this term would be contained in a 
single Tablet on a single TabletServer. This ultimately comes down to 
the amount and distribution of data you're storing.

Come back with more information, and we can give some more 
recommendations. Honestly, you probably won't get this right the first 
time, but this is expected :). What you can do is..

* Set some expectations on performance
* Do some simple math on actual data (estimate parallelism, latency, etc)
* Prototype and test it

mohit.kaushik wrote:
> Hi,
>
> We have an social media application currently using MongoDB to serve
> documents . We decided to shift it to Accumulo. I am designing the
> schema and indexing approach but having some difficulties in managing
> indexes and a few concerns with generating UUID in Accumulo.
>
> UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates
> a 12 byte UUID sorted on current time and good for multi-user
> multi-process environment (<time> <Mac add> <process id> <client
> counter> ) which is perfect. but if I concatenate the time,mac add,
> process-id, client counter. These are around 28 to 30 characters which
> means around 60 bytes. And If I store it in reverse order so that the
> latest document shows on top, the size would be doubled( more than 120
> bytes) as described by David Medinets. Is there any way to store this
> UUID in lesser size or any other efficient way to generate UUID reverse
> sorted on current time.
>
> Indexing : I need to retrieve documents from index based on some query
> on fields. I found two approaches to index documents in Accumulo.
> (1) Term based reverse indexing and
> (2) Document partitioning indexing
>
> As Adam described in this video
> https://www.youtube.com/watch?v=Ck70G6OuGT4. If I use Document
> partitioning indexing.
>
> Row <partition id>
> / \
> CF <doc> <index>
> | |
> CQ <UUID> <Term>
> | |
> <field> <UUID>
> | |
> | <Field>
> Value <value>
>
> If I just want to serve documents based on single term query. Would it
> be better to store <term> in column family so that I can limit on single
> term in CF. It will reduce the data by a good factor. what can be other
> pros and cons of this approach?
> And how should i decide the on partition_Id. If i storing tweets on 3
> node cluster?
>
> Regards
> Mohit Kaushik
>

Mime
View raw message