lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "Per Steffensen/Update semantics" by Per Steffensen
Date Tue, 17 Apr 2012 15:05:04 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "Per Steffensen/Update semantics" page has been changed by Per Steffensen:
http://wiki.apache.org/solr/Per%20Steffensen/Update%20semantics?action=diff&rev1=10&rev2=11

  
  == Motivation ==
  
- Solr is missing advanced features when using it as a NoSQL database and not just as a search
index. When talking about using it as a NoSQL database instead of just as a search index,
I primarily mean cases where you use Solr in a way where you have (potentially many) threads
concurrently inserting, updating and deleting documents. Using Solr as a search index, is
more about first indexing your entire world into Solr using one thread (or many threads, but
without the possibility that they mess with data indexed by another thread), and afterwards
solely using it for quering.
+ Solr is missing advanced features when using it as a NoSQL database and not just as a search
index. When talking about using it as a NoSQL database instead of just as a search index,
I primarily mean cases where you use Solr in a way where you have (potentially many) threads
concurrently inserting, updating and deleting documents. Using Solr as a search index, is
more about first indexing your entire world into Solr using one thread (or many threads, but
without the possibility that they mess with data indexed by another thread), and afterwards
solely using it for searching.
  
  Some of the features missing are:
-  * Insert semantics as we know it from RDBMSs: Do not insert a document if it already exists
in Solr. A document is defined to exist in Solr, if a document with the same value in uniqueKey-field
already exists in Solr. Very much like the following SQL does NOT insert (instead it fails
with a UniqueKeyConstraint error) if there is a unique key constraint on column "id" and a
row with id=1234 already exists: "INSERT INTO docs (id, column2, column3,...) VALUES (1234,
value2, value3,...)"
+  * Insert semantics as we know it from RDBMSs: Do not insert a document if it already exists
in Solr. A document is defined to exist in Solr, if a document with the same value in uniqueKey-field
already exists. Very much like the following SQL does NOT insert (instead it fails with a
UniqueKeyConstraint error) if there is a unique key constraint on column "id" and a row with
id=1234 already exists: "INSERT INTO docs (id, column2, column3,...) VALUES (1234, value2,
value3,...)"
-  * Update semantics as we know if from RDBMSs: Do not update if the document does not already
exist (it might have existed, but have been deleted). Very much like the following SQL does
NOT update anything if a row with id=1234 does not already exist: "UPDATE docs SET column2=value2,
column3=value3, ... WHERE id=1234"
-  * Update semantics with version control (for optimistic locking) as we know it from RDBMSs:
Do not update if the document does not already exist or if it has been changed since it was
loaded for update by the client doing the update. Very much like the following SQL does NOT
update a row with id=1234 if the version of the document in Solr at the time of update is
not (any longer) 5678: "UPDATE docs SET column2=value2, column3=value3, ... WHERE id=1234
AND version=5678". This feature is used by popular O/R-mappers (like Hibernate) to provide
a VersionConflict error if the object (row/document) you loaded for update has changed since
you loaded it when you try to store your updated version.
+  * Update semantics with version control (for optimistic locking) as we know it from RDBMSs:
Do not add document if the document does not already exist and do not update if it has been
changed since it was loaded for update by the client doing the update. Very much like the
following SQL does NOT update a row with id=1234 if the version of the document in Solr at
the time of update is not (any longer) 5678: "UPDATE docs SET column2=value2, column3=value3,
... WHERE id=1234 AND version=5678". This feature is used by popular O/R-mappers (like Hibernate)
to provide a VersionConflict error if the object (row/document) you loaded for update has
changed since you loaded it when you try to store your updated version.
  
- == Implementation ==
+ == Solution ==
  
+ The above features could have been implemented by providing you with different ways of "updating"
documents in Solr, than by using the "update-add-docs" operation. But instead the "update-add-docs"
operation is still the only operation you have for inserting/updating documents in Solr, but
now you have a way of controlling the exact semantics you want Solr to do behind the scenes.
First of all you need to decide wich sematics-mode you want to use - you have the following
options
+  * '''classic''': Solr uses the same update semantics as it has always done, without failing
on "unique key conflict" during create/insert and without failing on "version conflicts" during
update. This is default, so out of the box Solr works as always.
+  * '''consistent''': You are forced to (indirectly) state if your intent is to insert or
update. If your intent is to insert, the "update-add-docs" operation will fail if a document
with the same uniqueKey-value already exists. If your intent is to update, the "update-add-docs"
opeartion will fail if the document (a document with the same uniqueKey-value) does not already
exist, or if the value of the _version_-field does not match the value in the already existing
document. You state your intent by setting the value of the _version_ field
+  ** _version_ <= 0 (or not set): Intent is to insert
+  ** _version_ > 0: Intent is to update
+  * '''classic-consistent-hybrid''': An hybrid between '''classic''' and '''consistent'''.
Only difference from '''consistent''' is that you get '''classic''' semantics if you set _version_
to 0 (or dont set it)
- The above features could have been implemented by providing you with different ways of "updating"
documents in Solr, than by using the "update-add" operation. But instead the "update-add"
operation is still the only operation you have for inserting/updating documents in Solr, but
now you have a way of controlling the exact semantics you want Solr to do behind the scenes,
when you make an "update-add" request. You can control is by request parameters or by attributes
on your add content - just as you can with e.g. the "overwrite" and "commitWithin" flags.
- 
- If you think about it, the "overwrite" flag, that has been around for a while, is actually
already a way for you to control the inner semantics of Solr when processing your "update-add"
operation. So basically the implementation of the features mentioned on this page, replaces
the "overwrite" flag with a "semantics" flag. The "semantics" flag can take the following
values:
-  * classic-update: This is the default, so if you dont want to, you never have to explicitly
set "semantics=classic-update". You can if you want, though. Setting "semantics=classic-update"
makes Solr provide the same semantics in "update-add" operations, as it has always done when
"overwrite" was not set to "false" (overwrite=true is default)
-  * classic-update-dont-overwrite: The semantics you get with "semantics=classic-update-dont-overwrite"
is the same as what you have always gotten with "overwrite" set to false. Since "overwrite"
flag is still possible for backward compatibility, basically you can get this semantics in
two ways - either by setting "semantics=classic-update-dont-overwrite" or by setting "overwrite=false"
(and not setting "semantics")
-  * db-insert: Setting "semantics=db-insert" provides a new type of semantics behind "update-add"
operations. The document sent in the "update-add" request is added to the shard/core/index
if and only if the document does not already exist. Remember that a document "already exists"
if a document with the same value in uniqueKey-field is already in the core at the time of
the "update-add" operation. If the document does exist this semantics makes the "update-add"
operation result in a "DocumentAlreadyExists" error. This feature is of course "thread-safe"
in the way that, if the core does not contain a document with uniqueKey-field-value "cool_features",
and 10 client-threads "at the same time" tries to do an "update-add" operation with "semantics=db-insert"
and a document with uniqueKey-field-value "cool_features", only one thread will succeed -
9 threads will end up having a "DocumentAlreadyExists" error.
-  * db-update: Setting "semantics=db-update" provides another new type of semantics behind
"update" operation. The document sent in the "update-add" request is added (and the old corresponding
document deleted) if and only if the document already exists. If the document does not exist
(it might have existed, but have been deleted by the time of the "update-add" operation) you
will get a "DocumentDoesNotExist" error. If your schema constains a "_version_" field and
you put a value for the "_version_" field in the document you send for update, you will have
version control (for optimistic locking) does as well. The "update-add" operation will result
in a "VersionConflict" error, if the value of "_version_"-field in the document sent for update
does not match the value of the "_version_"-field of the document in Solr at the time of the
"update-add" operation. This feature is of course also "thread-safe" in the way that, if the
core contains a document with uniqueKey-field-value "versioning_rocks" and "_version_"-field-value
"5678", and 10 client-threads "at the same time" tries to do an "update" operation with "semantics=db-update"
and a document with uniqueKey-field-value "versioning_rocks" and "_version_"-field-value "5678",
only one thread will succeed - 9 threads will end up having a "VersionConflict" error.
  
  == Using it ==
  
- This section describes how to use the feature as a Solr client. Basically "semantics" has
been added as an [[UpdateXmlMessages#Optional_attributes_for_.22add.22|optinal attribute for
add]] and as a potential URL parameter.
+ This section describes how to use the features as a Solr client.
  
  === Requirements ===
  
- To use "semantics=db-insert" or "semantics=db-update" there are a few requirements to your
Solr schema and configuration.
+ You control the semantics-mode by adding a semanticsMode tag inside your DirectUpdateHandler2-based
updateHandler. In solrconfig.xml:
+ {{{#!xml
+   <updateHandler class="solr.DirectUpdateHandler2">
+     ...
+     <semanticsMode>put classic, consistency or classic-consistency-hybrid here</semanticsMode>
+     ...
+   </updateHandler>
+ }}}
+ '''classic''' is default if you dont add a semanticsMode tag.
+ 
+ To use '''consistency''' or the consistency-features of '''classic-consistency-hybrid'''
(really no reason to configure '''classic-consistency-hybrid''' if you plan to never use the
consistency-features (always sending 0 as _version_)) there are a few additional requirements
to your Solr schema (schema.xml) and configuration (solrconfig.xml).
  
   * You need to have a uniqueKey-field in your schema. In schema.xml e.g.:
  {{{#!xml
   <field name="id" type="string" indexed="true" stored="true" required="true"/>
   <uniqueKey>id</uniqueKey>
  }}}
-  * If you want version control, you need to have a "_version_"-field in your schema. In
schema.xml:
+  * If you want do consistency updates including version control (sending values for _version_
bigger than 0), you need to have a "_version_"-field in your schema. In schema.xml:
  {{{#!xml
   <field name="_version_" type="long" indexed="true" stored="true" />
  }}}
-  * You need to use DirectUpdateHandler2 as your update-handler with updateLog enabled. In
solrconfig.xml:
+  * You need to enable updateLog in your DirectUpdateHandler2-based updateHandler. In solrconfig.xml:
  {{{#!xml
   <updateHandler class="solr.DirectUpdateHandler2">
-   <updateLog>
+     ...
+     <updateLog class="solr.FSUpdateLog">
-    <str name="dir">${solr.data.dir:}</str> 
+       <str name="dir">${solr.data.dir:}</str>
-   </updateLog>
+     </updateLog>
+     ...
   </updateHandler>
  }}}
  
  === Homemade requests ===
  
- ==== Requst level control ====
+ ==== Construct requests - JSON ====
  
+ Send values for the _version_ field in your JavaScript documents (see more [[UpdateJSON|here]])
to control if you get insert-, update- og classic-semantics (depending on your semantics-mode
configuration) like this:
- Add "&semantics=XXXX", where XXXX is "semantics=db-insert" or "semantics=db-update"
(or one of the classic semantics), to your HTTP update requests. The semantics provided at
request level is the default which will be used for add operations where semantics not specifically
provided (see sections below). In general it is only meaningful to use add-operation level
control when you send more than one add-operation (and want different update semantics among
those) within the same request.
- 
- ==== Add-operation level control - JSON ====
- 
- And/or add semantics field to your "add" JavaScript Objects (see more [[UpdateJSON|here]])
  {{{#!json
- {
- "add": {
-   "semantics": XXXX
- }
- }
+ [
+  {... set doc fields ..., "_version_" : -1}
+  ... add other docs ...
+  {... set doc fields ..., "_version_" : 1234567890}
+ ]
  }}}
  
- ==== Add-operation level control - XML ====
+ ==== Construct requests - XML ====
  
- And/or add semantics attribute to your add tags (see more [[UpdateXMLMessages|here]])
+ Send values for the _version_ field in your JavaScript documents (see more [[UpdateXmlMessages|here]])
to control if you get insert-, update- og classic-semantics (depending on your semantics-mode
configuration) like this:
  {{{#!xml
- <add semantics="XXXX">
-   ...
+ <add>
+   <doc>
+     ... set doc fields ...
+     <field name="_version_">-1</field>
+   </doc>
+   ... add other docs ...
+   <doc>
+     ... set doc fields ...
+     <field name="_version_">1234567890</field>
+   </doc>
  </add>
  }}}
  
@@ -87, +101 @@

  
  === SolrJ requests ===
  
- ==== Requst level control ====
+ ==== Construct requests ====
  
  {{{#!java
- SolrServer server = ...
- UpdateRequest request = new UpdateRequest();
- request.add(doc1);
- request.add(doc2);
- ...
- request.add(docN);
- request.setParam(UpdateParams.UPDATE_SEMANTICS, UpdateSemantics.XXXX.toString());
- request.process(server);
+ List<SolrInputDocument> docs = new ArrayList<SolrInputDocuments>();
+ SolrInputDocument docA = new SolrInputDocument();
+ ... set docA fields ...
+ docA.addField(SolrInputDocument.VERSION_FIELD, -1);
+ docs.add(docA);
+ ... setup other docs ...
+ SolrInputDocument docN = new SolrInputDocument();
+ ... set docN fields ...
+ docN.addField(SolrInputDocument.VERSION_FIELD, 1234567890);
+ docs.add(docN);
  }}}
  
- ==== Add-operation level control ====
+ ==== Sending requests ====
  
- Remember, it is only meaningful to use add-operation level control when you send more than
one add-operation (and want different update semantics among those) within the same request.
+ {{{#!java
+ SolrServer server = ... somehow you have a server ...
  
- SolrJ supports sending more UpdateRequests in the same HTTP request when you are using StreamingUpdateSolrServer
as server in the code above. If you repeat the lines creating and processing an UpdateRequest
(of course different document(s) and potentially params) quickly enough, StreamingUpdateSolrServer
might choose to send multiple of the UpdateRequests in the same HTTP request. StreamingUpdateSolrServer
will send params on operation level if necessary, so seen from the SolrJ user the code above
is really all you need to know.
+ UpdateResponse response = server.add(docs, ... your SolrParams ...);
+ }}}
  
  ==== Catching errors ====
  
- TODO - both CommonsHttpSolrServer and StreamingUpdateSolrServer
+ If you send many documents in you request it is possible that the insert/update-operation
will fail for some documents (due to "unique key constraints", "version checking" etc) while
it will not for other documents. Therefore you need to deal with partial errors
+ 
+ {{{#!java
+ UpdateResponse response;
+ try {
+     response = server.add(docs, ... your SolrParams ...);
+ } catch (PartialErrors e) {
+     response = (UpdateResponse)e.getSpecializedResponse();
+     DocumentUpdatePartialError err;
+     err = response.getPartialError(docA);
+     ... if and only if err is not null the insert/update of docA failed ...
+     ... check for errors for other docs ...
+     err = response.getPartialError(docN);
+     ... if and only if err is not null the insert/update of docN failed ...
+ }
+ }}}
+ 
+ The possible exception types (subclasses of DocumentUpdatePartialError) of '''err''' are
+  * org.apache.solr.common.partialerrors.DocumentDoesNotExist: Indicating that the document
you tried to consistency update does not exist (anymore)
+  * org.apache.solr.common.partialerrors.DocumentAlreadyExists: Indicating that the document
you tried to consistency create already exists (or at least a document with the same uniqueKey
value)
+  * org.apache.solr.common.partialerrors.VersionConflict: Indicating that the document you
tried to consistency update has changed since you fetched it for update (version number has
changed)
+  * org.apache.solr.common.partialerrors.WrongUsage: Indicating that you are using the features
in a wrong way - e.g. if you try to do a consistency insert/update but there is no uniqueKey
defined in your Solr schema or no value for the uniqueKey-field of the document sent in the
request.
+ 
+ If you only send one document in your request there is no reason to deal with PartialErrors,
so for convenience catching per-document errors is possible like this
+ {{{#!java
+ try {
+     UpdateResponse response = server.add(... one doc ..., ... your SolrParams ...);
+ } catch (DocumentDoesNotExist e) {
+     ... do something ...
+ } catch (DocumentAlreadyExists e) {
+     ... do something ...
+ } catch (VersionConflict e) {
+     ... do something ...
+ } catch (WrongUsage e) {
+     ... do something ...
+ }
+ }}}
  
  === Realistic example ===
  
- TODO Show example of the server side of a Wiki application, using db-insert to prevent two
user from creating a page with the same name (unique key), and using db-update with version
control to prevent users overwriting each others changes to the text of a particular Wiki
page.
+ TODO Show example of the server side of a Wiki application, using consistency-inserts to
prevent two user from creating a page with the same name (unique key), and using consistency-updates
to prevent users overwriting each others changes to the text of a particular Wiki page.
  

Mime
View raw message