Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of andrey4get@gmail.com
 designates 209.85.218.67 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <C956589D6C8C4697A95B480550BE1EED@JackKrupansky14>
References: 
 <CAHTX=rb8=4r7gsC1qeHdTJBT8w+BLwmiw5gRhRUdjkeh7JmtXg@mail.gmail.com>
	<CAEFAe-GWgRSKzb6h=NL06MiOdSvv839XMmQzC0XKwRk+EHcZ-g@mail.gmail.com>
	<CAHTX=rajYe2ueaVJB_-61OiPVXFgAaKxdJ-PtAEsVbRsBJqoJA@mail.gmail.com>
	<C956589D6C8C4697A95B480550BE1EED@JackKrupansky14>
Date: Mon, 10 Nov 2014 09:50:30 +0100
Message-ID: 
 <CAHTX=raWg6xOkE3egdVUQCH2TbKyprKvC8nz2cr5_SQB6VVu5g@mail.gmail.com>
Subject: Re: on regards to Solr and NoSQL storages integration
From: andrey prokopenko <andrey4get@gmail.com>
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=001a1137c3103bb77f05077d41b5

--001a1137c3103bb77f05077d41b5
Content-Type: text/plain; charset=UTF-8

Thanks for the useful information on DataStax internals, very interesting.
My solution stores both primary key reference links and data itself in
NoSQL storage, i.e. index parts for the stored fields are not created at
all. I guess by the time DataStax has created their solution, there was no
Solr Codec API available yet, hence their choice on using updateprocessors
API.
Also, I've deliberately chosen to not use indexing queue and log while
updating the stored fields, thus enforcing any inserts and updated to the
segment always being in sync with NoSQL data, so that any errors during the
document inserts and deletions are immediately propagated to the user
application for possible recovery actions or rollback operations. So, I
think my solution can be safely used instead of paid DataStax app.
According to the tests, performance does not suffer neither from inserts,
nor from delete & merge operations being executed on large Solr indexes,
because in case of segment merge, only the key links, mapping internal
doc_id + segment _id to the user-defined unique primary key are updated in
NoSQL storage.
Approach used by me, can be repeated for integrating stored fields
Solr stored fields integration with any database backend, not just Oracle
NoSQL.

On Sun, Nov 9, 2014 at 5:09 AM, Jack Krupansky <jack@basetechnology.com>
wrote:

> There is no "double storage" of data - the Solr index for DataStax
> Enterprise ignores the "stored" attribute and only stores the primary key
> data to allow the Solr document to reference the Cassandra row, which is
> where the data is stored. The exception would be doc values, where the data
> does need to be kept in the index for efficient operation of Lucene and
> Solr, but that would only be done for fields such as facet fields and is
> under the complete control of the developer.
>
> DataStax Enterprise also utilizes an indexing queue so that Cassandra
> inserts and updates can occur at full speed, with indexing in a background
> thread, maximizing ingestion performance.
>
> -- Jack Krupansky
>
> -----Original Message----- From: andrey prokopenko
> Sent: Friday, November 7, 2014 5:00 AM
> To: solr-user@lucene.apache.org
> Subject: Re: on regards to Solr and NoSQL storages integration
>
>
> Thanks for the reply. I've considered DataStax, but dropped it first due to
> the commercial model they're using and second due to the integration model
> they have chosen to integrate with Cassandra. In their docs (can be found
> here:
> http://www.datastax.com/docs/datastax_enterprise3.1/
> solutions/dse_search_load_data),
> they do not disclose the architecture and details of their integration
> solution, yet the examination of the Solr configuration and handlers from
> their distribution package has revealed that they essentially let the docs
> rest both in Solr index and Cassandra storage. To safely propagating
> documents on  each Solr index update to Casssandra they use their own
> update handler + custom update log.
> In my opinion, this is not very efficient, because it doubles docs storage
> and leaves Solr index as heavy as it is currently. My approach completely
> relays stored fields storage to NoSQL database, using user-defined key
> unique key. This gives the users quickly do partial updates of stored but
> non-indexed non-indexed fields and greatly reduces time required to
> replication in case of heavy write/load.
>
> On Wed, Nov 5, 2014 at 4:04 PM, Alexandre Rafalovitch <arafalov@gmail.com>
> wrote:
>
>  On 5 November 2014 08:52, andrey prokopenko <andrey4get@gmail.com> wrote:
>> > I assume, there might be other developers, trying to solve similar
>> > problems, so I'd be interested to hear about similar attempts & issues
>> > encountered while trying to implement such an integration between Solr
>> and
>> > other NoSQL databases.
>>
>> I think DataStax does Solr+Cassandra and Cloudera does Solr+Hadoop
>> with underlying content stored in the databases. Also Neo4J has
>> graph+search integration, but I think it's directly using Lucene
>> engine, not Solr.
>>
>> Disclaimer: this is very high level understanding, hopefully the other
>> people can confirm.
>>
>> Regards,
>>    Alex.
>>
>> Personal: http://www.outerthoughts.com/ and @arafalov
>> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>

--001a1137c3103bb77f05077d41b5--