Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of onlinespending@gmail.com
 designates 74.125.83.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=PzAt03iNmVYlt+jocqEKmJN3THe9jcIxBhkat8FAd3vXfiXl2jtuFEKH4ZF15Bwli4
         LGYG2H1t5KYHV30TLRxIulng6tcjNgTfV5H5jF+iXu4QhW7mNDIev8eS073d04157+iq
         2kVjHR65gm4TouyEjATOF8mr9CvPCCv2c7wDY=
MIME-Version: 1.0
In-Reply-To: <4D876762.80400@elyograg.org>
References: <AANLkTi=8DJkUTRFwTBDmcY0XSgOmYyYcXx4H64xtTg7v@mail.gmail.com>
	<4D7F65D0.5070909@elyograg.org>
	<AANLkTim8vTbeZV+TOhTvO-iNbU98+cZyTX6Oinnajt_N@mail.gmail.com>
	<4D876762.80400@elyograg.org>
Date: Mon, 21 Mar 2011 11:52:45 -0400
Message-ID: <AANLkTin=bdyjs180k5Phq9MBYzqNNoJYY4TnWMOLz3jJ@mail.gmail.com>
Subject: Re: keeping data consistent between Database and Solr
From: "onlinespending@gmail.com" <onlinespending@gmail.com>
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=90e6ba6e864665662c049f001e68

--90e6ba6e864665662c049f001e68
Content-Type: text/plain; charset=ISO-8859-1

On Mon, Mar 21, 2011 at 10:57 AM, Shawn Heisey <solr@elyograg.org> wrote:

> On 3/15/2011 12:54 PM, onlinespending@gmail.com wrote:
>
>> That's pretty interesting to use the autoincrementing document ID as a way
>> to keep track of what has not been indexed in Solr.  And you overwrite
>> this
>> document ID even when you modify an existing document.  Very cool.  I
>> suppose the number can even rotate back to 0, as long as you handle that.
>>
>
> We use a bigint for the value, and the highest value is currently less than
> 300 million, so we don't expect it to ever rotate around to 0.  My build
> system would not be able to handle wrapraound without manual intervention.
>  If we have that problem, I think we'd have to renumber the entire database
> and reindex.


One solution to reduce the rate at which this number grows would be to store
a "batch ID" rather than a "document ID". If you've just added batch #1428
to the Solr index, then any new updated documents in your SQL database would
be assigned #1429. Since you already have a unique tag ID, you may be OK
with a non-unique ID for the sake of keeping track of index updates.


>
>
>  I am thinking of using a timestamp to achieve a similar thing. All
>> documents
>> that have been accessed after the last Solr index need to be added to the
>> Solr index.  In fact, each name-value pair in Cassandra has a timestamp
>> associated with it, so I'm curious if I could simply use this.
>>
>
> As long as you can guarantee that it's all deterministic and idempotent,
> you can use anything you like.  I hope you know what those words mean. :)
>  It's important when using timestamps that the system that runs the build
> script is the same one that stores the last-used timestamp.  That way you
> are guaranteed that you will never have things getting missed because of
> clock skew.


Yes, that is a concern of mine. If I go with a timestamp I'll certainly need
to pay close attention to things.


>
>
>  I'm curious how you handle the delta-imports. Do you have some routine
>> that
>> periodically checks for updates to your MySQL database via the document
>> ID?
>> Which language do you use for that?
>>
>
> The entire build system is written in Perl, where I am comfortable.  I even
> wrote an object-oriented module that the scripts share.  The update script
> runs every two minutes, from cron, indexing anything with a higher document
> ID than the one recorded during the last successful run.  There are some
> other scripts that run on longer intervals and handle things like deletes
> and data redistribution into shards.  These scripts kick off the build, then
> use the bare /dataimport URL to track when the import completes and whether
> it's successful.


> Thanks,
> Shawn
>

Thanks for the info. That's very helpful!

Ben

--90e6ba6e864665662c049f001e68--