accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <>
Subject Re: Questions on intersecting iterator and partition ids
Date Mon, 13 Jul 2015 17:56:49 GMT

I have included some answers below.


On Mon, Jul 13, 2015 at 11:19 AM, vaibhav thapliyal <> wrote:

> Dear all,
> I have the following questions on intersecting iterator and partition ids
> used in document sharded indexing:
> 1. Can we run a boolean and query using the current intersecting iterator
> on a given range of ids. These ids are a subset of the total ids stored in
> the column qualifier field as per the document sharded indexing format.
The IntersectingIterator is designed to do index intersections, which are
very similar to boolean AND queries. It does require indexes to be built in
a particular fashion. You should play around with the WikiSearch example ( to get familiar with
its use.

> If it's not possible with current iterator can I tweak the existing one?
If you are indexing documents similar to what the IntersectingIterator
expects then you should be able to get it to work for you. More generally,
any row-local logic can be implemented in an iterator. If you're not
building indexes then you might want to look at the RowFilter as a starting

>  2. Is the partitioning suggested in document sharded indexing logical or
> physical. For eg if I have 30 partition ids do I have to physically
> presplit the table based on the partition ids for the and query to run in
> the most efficient way so that I have 30 tablets in table?
You don't have to pre-split -- Accumulo will automatically split big rows
into their own tablets. However, there are some performance advantages to
pre-splitting before your tablet gets big enough to split on its own.

>  3.  Lastly,  Can anybody suggest me the number of partitions for
> document sharded indexing. What should I look for when deciding it?
You have to consider a few factors for this: (a) ingest parallelization,
for which you want approximately as many partitions as you have cores in
your cluster, (b) size of a partition when full, which you want to be under
about 20GB for compaction performance reasons, and (c) query parallelism,
for which you want no more than a small factor of the number of cores in
your cluster to reduce query latency. If you can't find a solution that
works for all of these factors then you will be forced to make trade-offs
(or do something complicated like time-based partitioning).

>  Thanks
> Vaibhav

View raw message