lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Heisey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3954) Option to have updateHandler and DIH skip updateLog
Date Tue, 16 Oct 2012 23:09:03 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477445#comment-13477445
] 

Shawn Heisey commented on SOLR-3954:
------------------------------------

Here's a direct comparison on the same hardware.  It might be important to know that when
my import gets kicked off, there are actually four imports running.  One of them is small
-- during the second test (updateLog off), it imported 687765 rows in 10 minutes and 08 seconds.
 I did not check how long it took during the first test.  The other three imports are all
nearly 13 million records each.

A du on the completed index directory with 12.9 million records shows 23520900 KB.

I ran the first test and grabbed stats after an hour.  Then I killed Solr, commented out updateLog,
started it up again, kicked off the full-import, and again grabbed stats after an hour.  Comparing
the two shows that it is about twice as fast with updateLog turned off.

With updateLog turned on:

{code}
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
</lst>
<lst name="initArgs">
  <lst name="defaults">
    <str name="config">dih-config.xml</str>
  </lst>
</lst>
<str name="status">busy</str>
<str name="importResponse">A command is still running...</str>
<lst name="statusMessages">
  <str name="Time Elapsed">1:0:1.762</str>
  <str name="Total Requests made to DataSource">1</str>
  <str name="Total Rows Fetched">2052096</str>
  <str name="Total Documents Processed">2052095</str>
  <str name="Total Documents Skipped">0</str>
  <str name="Full Dump Started">2012-10-16 14:59:01</str>
</lst>
<str name="WARNING">This response format is experimental.  It is likely to change in
the future.</str>
</response>
{code}

With updateLog turned off:

{code}
<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
</lst>
<lst name="initArgs">
  <lst name="defaults">
    <str name="config">dih-config.xml</str>
  </lst>
</lst>
<str name="status">busy</str>
<str name="importResponse">A command is still running...</str>
<lst name="statusMessages">
  <str name="Time Elapsed">1:0:0.434</str>
  <str name="Total Requests made to DataSource">1</str>
  <str name="Total Rows Fetched">4167525</str>
  <str name="Total Documents Processed">4167524</str>
  <str name="Total Documents Skipped">0</str>
  <str name="Full Dump Started">2012-10-16 16:05:01</str>
</lst>
<str name="WARNING">This response format is experimental.  It is likely to change in
the future.</str>
</response>
{code}

                
> Option to have updateHandler and DIH skip updateLog
> ---------------------------------------------------
>
>                 Key: SOLR-3954
>                 URL: https://issues.apache.org/jira/browse/SOLR-3954
>             Project: Solr
>          Issue Type: Improvement
>          Components: update
>    Affects Versions: 4.0
>            Reporter: Shawn Heisey
>             Fix For: 4.1
>
>
> The updateLog feature makes updates take longer, likely because of the I/O time required
to write the additional information to disk.  It may take as much as three times as long for
the indexing portion of the process.  I'm not sure whether it affects the time to commit,
but I would imagine that the difference there is small or zero.  When doing incremental updates/deletes
on an existing index, the time lag is probably very small and unimportant.
> When doing a full reindex (which may happen via DIH), especially if this is done in a
build core that is then swapped with a live core, this performance hit is unacceptable.  It
seems to make the import take about three times as long.
> An option to have an update skip the updateLog would be very useful for these situations.
 It should have a method in SolrJ and be exposed in DIH as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message