Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CC12C183A7 for ; Wed, 30 Sep 2015 14:36:52 +0000 (UTC) Received: (qmail 32561 invoked by uid 500); 30 Sep 2015 14:36:49 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 32511 invoked by uid 500); 30 Sep 2015 14:36:49 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 32501 invoked by uid 99); 30 Sep 2015 14:36:49 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Sep 2015 14:36:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id EFF961A352A for ; Wed, 30 Sep 2015 14:36:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.102 X-Spam-Level: X-Spam-Status: No, score=-0.102 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id LPLMYXLYvRTJ for ; Wed, 30 Sep 2015 14:36:48 +0000 (UTC) Received: from mail-qg0-f43.google.com (mail-qg0-f43.google.com [209.85.192.43]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id B1C5620F4A for ; Wed, 30 Sep 2015 14:36:47 +0000 (UTC) Received: by qgx61 with SMTP id 61so36686004qgx.3 for ; Wed, 30 Sep 2015 07:36:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=3vcINV7Tvuvw97KFxhMIz9VMMsnTfKjInMp9FVGqJyo=; b=B9S3MoiJbZvUpXyMZsjAsPnUqRpKtOQiDdhbyORAd8BGLhsY8tw0dGnii9l/L6LExO afwsPMN0QPc44iYXZKVVlkb5FcSQXM6enJY4Xjy3BuQkAGNGQD0J4Go+1WOjFmNlMfx7 t8qP35tpN0phB7ZY+Yp84eL5A1IWD7FXXWUY0k+clliByJRZl4cqAbdO4geVFa5TyZ3j J3s2eN1KHr5G8EBiarPzCWzXUoX9hDanUdR7RpjKeQ6OjpgLJSuNMe6FeRAz9+aA5KSt trcH//ucEF7dn8rjxWCvnqVpIzOP3ND3t3EsPJnYt2h2ate6pzI53SskQxf10SyOGUJj 4qNw== X-Received: by 10.140.96.84 with SMTP id j78mr4874620qge.94.1443623800601; Wed, 30 Sep 2015 07:36:40 -0700 (PDT) Received: from hw10447.local (pool-68-134-10-53.bltmmd.fios.verizon.net. [68.134.10.53]) by smtp.googlemail.com with ESMTPSA id u81sm326029qku.47.2015.09.30.07.36.39 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 30 Sep 2015 07:36:39 -0700 (PDT) Message-ID: <560BF379.3020208@gmail.com> Date: Wed, 30 Sep 2015 10:36:41 -0400 From: Josh Elser User-Agent: Postbox 3.0.11 (Macintosh/20140602) MIME-Version: 1.0 To: user@accumulo.apache.org Subject: Re: Document Partitioned Indexing References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Tom D wrote: > Hi, > > Have been doing a little reading about different distributed (text) > indexing techniques and picked up on the Document Partitioned Index > approach on Accumulo. > > I am interested in the use-cases people would have for indexing data in > this way over using a distributed search service (Elastic or SolrCloud). > > I can think of a few reasons, but wondered if there's something more > obvious that I'm missing? > > - cell (field level) access controls If you have this as a requirement, you're in the right place :) > - scale - I understand Accumulo will scale to thousands of nodes. I > believe there are some limitations in Elastic / Solr at about 100 nodes. High speed ingest and random point-lookups are big architectural features that Accumulo provides. I don't know enough about ES/Solr to say how they compare, but I can say that these fundamentals will work well from one to many nodes with Accumulo. > - integration with an existing schema or index in Accumulo (not sure > about this one and what benefits it would have over calling out to a > search service) > > - you want to take advantage of other features in Accumulo, e.g. > Combining iterators to perform some aggregation alongside your document > partitioned index (again, can't imagine use cases here, but maybe there > are some) Being able to leverage some of the "native" filtering aspects that Accumulo provides (e.g. locality groups/column-family filtering, server-side filters/iterators and combiners) result in a light-weight client. The I/O heavy operations are done by Accumulo and pass a reduced/filtered view of just the data you need reducing the CPU cycles for your client and the amount of data sent over the wire (increasing the performance of your application). > - more control over 'messy data', e.g partial duplicates that need > merging at ingest Maybe? Not requiring a fixed schema on each row is definitely a perk of Accumulo, but data cleansing isn't necessarily solved by Accumulo. You still need to know what you put into it. However, being able to aggregate multiple updates to a Cell/Value via Accumulo Combiners can be a very powerful tool that simplifies your ingest logic. > Are there others? Be interesting to hear if people use this indexing > strategy. It's definitely a common indexing strategy and you've identified a lot of the perks that Accumulo provides. The specific requirements of your application will determine how exactly you will leverage the features. Let us know, we can help give some pointers on how to go about this :) > Many thanks. > >