Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Message-ID: <560BF379.3020208@gmail.com>
Date: Wed, 30 Sep 2015 10:36:41 -0400
From: Josh Elser <josh.elser@gmail.com>
User-Agent: Postbox 3.0.11 (Macintosh/20140602)
MIME-Version: 1.0
To: user@accumulo.apache.org
Subject: Re: Document Partitioned Indexing
References: 
 <CACAwpDPVBJmqtfFmw9o4Lgc5Gm+GPUgvP_qB-St_LDLRjThWoA@mail.gmail.com>
In-Reply-To: 
 <CACAwpDPVBJmqtfFmw9o4Lgc5Gm+GPUgvP_qB-St_LDLRjThWoA@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

Tom D wrote:
> Hi,
>
> Have been doing a little reading about different distributed (text)
> indexing techniques and picked up on the Document Partitioned Index
> approach on Accumulo.
>
> I am interested in the use-cases people would have for indexing data in
> this way over using a distributed search service (Elastic or SolrCloud).
>
> I can think of a few reasons, but wondered if there's something more
> obvious that I'm missing?
>
> - cell (field level) access controls

If you have this as a requirement, you're in the right place :)

> - scale - I understand Accumulo will scale to thousands of nodes. I
> believe there are some limitations in Elastic / Solr at about 100 nodes.

High speed ingest and random point-lookups are big architectural 
features that Accumulo provides. I don't know enough about ES/Solr to 
say how they compare, but I can say that these fundamentals will work 
well from one to many nodes with Accumulo.

> - integration with an existing schema or index in Accumulo (not sure
> about this one and what benefits it would have over calling out to a
> search service)
>
> - you want to take advantage of other features in Accumulo, e.g.
> Combining iterators to perform some aggregation alongside your document
> partitioned index (again, can't imagine use cases here, but maybe there
> are some)

Being able to leverage some of the "native" filtering aspects that 
Accumulo provides (e.g. locality groups/column-family filtering, 
server-side filters/iterators and combiners) result in a light-weight 
client. The I/O heavy operations are done by Accumulo and pass a 
reduced/filtered view of just the data you need reducing the CPU cycles 
for your client and the amount of data sent over the wire (increasing 
the performance of your application).

> - more control over 'messy data', e.g partial duplicates that need
> merging at ingest

Maybe? Not requiring a fixed schema on each row is definitely a perk of 
Accumulo, but data cleansing isn't necessarily solved by Accumulo. You 
still need to know what you put into it.

However, being able to aggregate multiple updates to a Cell/Value via 
Accumulo Combiners can be a very powerful tool that simplifies your 
ingest logic.

> Are there others? Be interesting to hear if people use this indexing
> strategy.

It's definitely a common indexing strategy and you've identified a lot 
of the perks that Accumulo provides. The specific requirements of your 
application will determine how exactly you will leverage the features. 
Let us know, we can help give some pointers on how to go about this :)

> Many thanks.
>
>