Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 35B1E185A2 for ; Wed, 30 Sep 2015 15:27:23 +0000 (UTC) Received: (qmail 58859 invoked by uid 500); 30 Sep 2015 15:27:11 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 58808 invoked by uid 500); 30 Sep 2015 15:27:11 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 58799 invoked by uid 99); 30 Sep 2015 15:27:11 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Sep 2015 15:27:11 +0000 Received: from mail-ig0-f182.google.com (mail-ig0-f182.google.com [209.85.213.182]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 852891A025E for ; Wed, 30 Sep 2015 15:27:11 +0000 (UTC) Received: by igbkq10 with SMTP id kq10so107751552igb.0 for ; Wed, 30 Sep 2015 08:27:11 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.50.78.7 with SMTP id x7mr19276687igw.54.1443626831029; Wed, 30 Sep 2015 08:27:11 -0700 (PDT) Received: by 10.107.4.140 with HTTP; Wed, 30 Sep 2015 08:27:10 -0700 (PDT) In-Reply-To: References: Date: Wed, 30 Sep 2015 11:27:10 -0400 Message-ID: Subject: Re: Document Partitioned Indexing From: Adam Fuchs To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=089e0117646d73f9290520f89052 --089e0117646d73f9290520f89052 Content-Type: text/plain; charset=UTF-8 Hi Tom, Sqrrl uses a document-distributed indexing strategy extensively. On top of the reasons you mentioned, we also like the ability to explicitly structure our index entries in both information content and sort order. This gives us the ability to do interesting things like build custom indexes and do joins between graph indexes and term indexes. Eventually, I'd like to see Accumulo build out explicit support for this type of indexing in the core as an embedded secondary indexing capability. That would solve several of the challenges around compatibility with other Accumulo features and usage patterns. Cheers, Adam On Wed, Sep 30, 2015 at 3:48 AM, Tom D wrote: > Hi, > > Have been doing a little reading about different distributed (text) > indexing techniques and picked up on the Document Partitioned Index > approach on Accumulo. > > I am interested in the use-cases people would have for indexing data in > this way over using a distributed search service (Elastic or SolrCloud). > > I can think of a few reasons, but wondered if there's something more > obvious that I'm missing? > > - cell (field level) access controls > > - scale - I understand Accumulo will scale to thousands of nodes. I > believe there are some limitations in Elastic / Solr at about 100 nodes. > > - integration with an existing schema or index in Accumulo (not sure about > this one and what benefits it would have over calling out to a search > service) > > - you want to take advantage of other features in Accumulo, e.g. Combining > iterators to perform some aggregation alongside your document partitioned > index (again, can't imagine use cases here, but maybe there are some) > > - more control over 'messy data', e.g partial duplicates that need merging > at ingest > > Are there others? Be interesting to hear if people use this indexing > strategy. > > Many thanks. > > > --089e0117646d73f9290520f89052 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Tom,

Sqrrl us= es a document-distributed indexing strategy extensively. On top of the reas= ons you mentioned, we also like the ability to explicitly structure our ind= ex entries in both information content and sort order. This gives us the ab= ility to do interesting things like build custom indexes and do joins betwe= en graph indexes and term indexes.

Eventually, I&#= 39;d like to see Accumulo build out explicit support for this type of index= ing in the core as an embedded secondary indexing capability. That would so= lve several of the challenges around compatibility with other Accumulo feat= ures and usage patterns.

Cheers,
Adam


On Wed, Sep 30, 2015 at 3:48 AM, Tom D <tomdata8@gmail.com>= wrote:
Hi,

Have been doing a little reading about different distr= ibuted (text) indexing techniques and picked up on the Document Partitioned= Index approach on Accumulo.

I am interested in the use-cases peopl= e would have for indexing data in this way over using a distributed search = service (Elastic or SolrCloud).

I can think of a few reasons, but wo= ndered if there's something more obvious that I'm missing?

-= cell (field level) access controls

- scale - I understand Accumulo = will scale to thousands of nodes. I believe there are some limitations in E= lastic / Solr at about 100 nodes.

- integration with an existing sc= hema or index in Accumulo (not sure about this one and what benefits it wou= ld have over calling out to a search service)

- you want to take adv= antage of other features in Accumulo, e.g. Combining iterators to perform s= ome aggregation alongside your document partitioned index (again, can't= imagine use cases here, but maybe there are some)

- more control ov= er 'messy data', e.g partial duplicates that need merging at ingest=

Are there others? Be interesting to hear if people use this indexin= g strategy.

Many thanks.



--089e0117646d73f9290520f89052--