lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "onlinespending@gmail.com" <onlinespend...@gmail.com>
Subject Re: keeping data consistent between Database and Solr
Date Mon, 21 Mar 2011 15:52:45 GMT
On Mon, Mar 21, 2011 at 10:57 AM, Shawn Heisey <solr@elyograg.org> wrote:

> On 3/15/2011 12:54 PM, onlinespending@gmail.com wrote:
>
>> That's pretty interesting to use the autoincrementing document ID as a way
>> to keep track of what has not been indexed in Solr.  And you overwrite
>> this
>> document ID even when you modify an existing document.  Very cool.  I
>> suppose the number can even rotate back to 0, as long as you handle that.
>>
>
> We use a bigint for the value, and the highest value is currently less than
> 300 million, so we don't expect it to ever rotate around to 0.  My build
> system would not be able to handle wrapraound without manual intervention.
>  If we have that problem, I think we'd have to renumber the entire database
> and reindex.


One solution to reduce the rate at which this number grows would be to store
a "batch ID" rather than a "document ID". If you've just added batch #1428
to the Solr index, then any new updated documents in your SQL database would
be assigned #1429. Since you already have a unique tag ID, you may be OK
with a non-unique ID for the sake of keeping track of index updates.


>
>
>  I am thinking of using a timestamp to achieve a similar thing. All
>> documents
>> that have been accessed after the last Solr index need to be added to the
>> Solr index.  In fact, each name-value pair in Cassandra has a timestamp
>> associated with it, so I'm curious if I could simply use this.
>>
>
> As long as you can guarantee that it's all deterministic and idempotent,
> you can use anything you like.  I hope you know what those words mean. :)
>  It's important when using timestamps that the system that runs the build
> script is the same one that stores the last-used timestamp.  That way you
> are guaranteed that you will never have things getting missed because of
> clock skew.


Yes, that is a concern of mine. If I go with a timestamp I'll certainly need
to pay close attention to things.


>
>
>  I'm curious how you handle the delta-imports. Do you have some routine
>> that
>> periodically checks for updates to your MySQL database via the document
>> ID?
>> Which language do you use for that?
>>
>
> The entire build system is written in Perl, where I am comfortable.  I even
> wrote an object-oriented module that the scripts share.  The update script
> runs every two minutes, from cron, indexing anything with a higher document
> ID than the one recorded during the last successful run.  There are some
> other scripts that run on longer intervals and handle things like deletes
> and data redistribution into shards.  These scripts kick off the build, then
> use the bare /dataimport URL to track when the import completes and whether
> it's successful.


> Thanks,
> Shawn
>

Thanks for the info. That's very helpful!

Ben

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message