lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "DataImportHandlerDeltaQueryViaFullImport" by Lukas Smith
Date Wed, 22 Sep 2010 11:38:24 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DataImportHandlerDeltaQueryViaFullImport" page has been changed by Lukas Smith.
The comment on this change is: taken from http://pooteeweet.org/blog/1827.
http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

--------------------------------------------------

New page:
= Using query'' attribute for both full and delta import =

The standard approach in Solr is to define one query for the initial import and a second query
to fetch the IDs of documents that have changed and a third query to fetch the data that changed.
Especially if you expect a large number of changes this isn't very efficient. Furthermore
if both in the initial import and the delta case you have the same SELECT list, its tedious
to maintain 3 queries where two are almost identical and one still very similar. The fundamental
idea is to only define one query for both the full and delta import by reading out the requests
parameters on a full-import.

== Example ==

So take the example from the docs:
{{{
<entity name="item" pk="ID"
  query="SELECT * FROM item"
  deltaImportQuery="SELECT * FROM item
    WHERE id = '${dataimporter.delta.id}'"
  deltaQuery="SELECT id FROM item
    WHERE last_modified > '${dataimporter.last_index_time}'">
}}}

This can be rewritten as follows:
{{{
<entity name="item" pk="ID"
  query="SELECT * FROM item
    WHERE '${dataimporter.request.clean}' != 'false'
      OR last_modified > '${dataimporter.last_index_time}'">
}}}

When doing a normal full import solr defaults the {{{clean}}} to {{{true}}} (watch out you
can override this default in the solrconfig.xml so you might want to always set it explicitly
to true). As a result the first part of the WHERE condition will be {{{'true' != 'false'}}}
which any decent RDBMS will figure out will lead to the entire all always evaluating to true
aka reading the entire table:
{{{http://localhost:8983/solr/core0/dataimport?command=full-import&clear=true}}}

Now when doing a delta import you do not use the delta import, but instead you do a normal
full-import but with the {{{clear}}} GET parameter set to {{{false}}}:
{{{http://localhost:8983/solr/core0/dataimport?command=full-import&clear=false}}}

In this case the first part of the WHERE will be {{{'false' != 'false'}}} which is obviously
always false and any RDBMS should optimize that away and just evaluate the second condition.

In this case it means obviously that in case you also want to use {{{deletedPkQuery}}} then
when running the {{{delta-import}}} command is still necessary.

== Efficiency Aspect ==

There might be situations where separate queries might be more efficient. Consider an example
where the query that fetches the document data is a very complex join and the RDBMS doesn't
do a great job with coming up with a good query plan and you can determine a relatively small
number of ID's that change every night with a SELECT without a JOIN, then it might be more
efficient to do one query to fetch the ID's and then one query per fetched ID that does the
complex join.

Mime
View raw message