Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A61CD172A2 for ; Mon, 10 Nov 2014 08:52:28 +0000 (UTC) Received: (qmail 94742 invoked by uid 500); 10 Nov 2014 08:52:25 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 94678 invoked by uid 500); 10 Nov 2014 08:52:25 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 94663 invoked by uid 99); 10 Nov 2014 08:52:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Nov 2014 08:52:24 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of andrey4get@gmail.com designates 209.85.218.67 as permitted sender) Received: from [209.85.218.67] (HELO mail-oi0-f67.google.com) (209.85.218.67) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Nov 2014 08:52:20 +0000 Received: by mail-oi0-f67.google.com with SMTP id a141so1280142oig.6 for ; Mon, 10 Nov 2014 00:50:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=6ps+XSc0vHE99hH8xy9cayJLp1yG7lFq5cP/TxNZTKo=; b=To2rmo4SLMVfI2INyPyM/7WJe93XIKghovDlQHKad8k7SrEG5Rs+28z07/DmfPolsW GOomHxzeJiHTXuBYEM5H8jPxq1cNPpdaBn/CFzAPbZt1yXNPfH1tgesXX5EGObEaObaB SfwcGSvoiTmJ0VfCVxoDQnNWbBool3mN0roZzMG43lMRwsVh7UjWu5HjqFXA9WCcs6LJ Hs18RAl4D/ZC9GGiCnoT6FaksR7I1hh/6H8gkF+NUuWw3MsBGRWSgV3qZX/dtL4Au3eF itJoHw1pUc4BT7tGoRRlbinatD2QXBJskO+i8CLt4nv9ZwY1ynFmkHCl83TmFfDl4UHt ckUQ== MIME-Version: 1.0 X-Received: by 10.202.0.212 with SMTP id 203mr24352918oia.44.1415609430290; Mon, 10 Nov 2014 00:50:30 -0800 (PST) Received: by 10.76.103.98 with HTTP; Mon, 10 Nov 2014 00:50:30 -0800 (PST) In-Reply-To: References: Date: Mon, 10 Nov 2014 09:50:30 +0100 Message-ID: Subject: Re: on regards to Solr and NoSQL storages integration From: andrey prokopenko To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a1137c3103bb77f05077d41b5 X-Virus-Checked: Checked by ClamAV on apache.org --001a1137c3103bb77f05077d41b5 Content-Type: text/plain; charset=UTF-8 Thanks for the useful information on DataStax internals, very interesting. My solution stores both primary key reference links and data itself in NoSQL storage, i.e. index parts for the stored fields are not created at all. I guess by the time DataStax has created their solution, there was no Solr Codec API available yet, hence their choice on using updateprocessors API. Also, I've deliberately chosen to not use indexing queue and log while updating the stored fields, thus enforcing any inserts and updated to the segment always being in sync with NoSQL data, so that any errors during the document inserts and deletions are immediately propagated to the user application for possible recovery actions or rollback operations. So, I think my solution can be safely used instead of paid DataStax app. According to the tests, performance does not suffer neither from inserts, nor from delete & merge operations being executed on large Solr indexes, because in case of segment merge, only the key links, mapping internal doc_id + segment _id to the user-defined unique primary key are updated in NoSQL storage. Approach used by me, can be repeated for integrating stored fields Solr stored fields integration with any database backend, not just Oracle NoSQL. On Sun, Nov 9, 2014 at 5:09 AM, Jack Krupansky wrote: > There is no "double storage" of data - the Solr index for DataStax > Enterprise ignores the "stored" attribute and only stores the primary key > data to allow the Solr document to reference the Cassandra row, which is > where the data is stored. The exception would be doc values, where the data > does need to be kept in the index for efficient operation of Lucene and > Solr, but that would only be done for fields such as facet fields and is > under the complete control of the developer. > > DataStax Enterprise also utilizes an indexing queue so that Cassandra > inserts and updates can occur at full speed, with indexing in a background > thread, maximizing ingestion performance. > > -- Jack Krupansky > > -----Original Message----- From: andrey prokopenko > Sent: Friday, November 7, 2014 5:00 AM > To: solr-user@lucene.apache.org > Subject: Re: on regards to Solr and NoSQL storages integration > > > Thanks for the reply. I've considered DataStax, but dropped it first due to > the commercial model they're using and second due to the integration model > they have chosen to integrate with Cassandra. In their docs (can be found > here: > http://www.datastax.com/docs/datastax_enterprise3.1/ > solutions/dse_search_load_data), > they do not disclose the architecture and details of their integration > solution, yet the examination of the Solr configuration and handlers from > their distribution package has revealed that they essentially let the docs > rest both in Solr index and Cassandra storage. To safely propagating > documents on each Solr index update to Casssandra they use their own > update handler + custom update log. > In my opinion, this is not very efficient, because it doubles docs storage > and leaves Solr index as heavy as it is currently. My approach completely > relays stored fields storage to NoSQL database, using user-defined key > unique key. This gives the users quickly do partial updates of stored but > non-indexed non-indexed fields and greatly reduces time required to > replication in case of heavy write/load. > > On Wed, Nov 5, 2014 at 4:04 PM, Alexandre Rafalovitch > wrote: > > On 5 November 2014 08:52, andrey prokopenko wrote: >> > I assume, there might be other developers, trying to solve similar >> > problems, so I'd be interested to hear about similar attempts & issues >> > encountered while trying to implement such an integration between Solr >> and >> > other NoSQL databases. >> >> I think DataStax does Solr+Cassandra and Cloudera does Solr+Hadoop >> with underlying content stored in the databases. Also Neo4J has >> graph+search integration, but I think it's directly using Lucene >> engine, not Solr. >> >> Disclaimer: this is very high level understanding, hopefully the other >> people can confirm. >> >> Regards, >> Alex. >> >> Personal: http://www.outerthoughts.com/ and @arafalov >> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart >> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 >> >> > --001a1137c3103bb77f05077d41b5--