Return-Path: Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: (qmail 45195 invoked from network); 21 Mar 2011 15:55:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 Mar 2011 15:55:09 -0000 Received: (qmail 71663 invoked by uid 500); 21 Mar 2011 15:55:06 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 71594 invoked by uid 500); 21 Mar 2011 15:55:06 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 71586 invoked by uid 99); 21 Mar 2011 15:55:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Mar 2011 15:55:06 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of onlinespending@gmail.com designates 74.125.83.48 as permitted sender) Received: from [74.125.83.48] (HELO mail-gw0-f48.google.com) (74.125.83.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Mar 2011 15:55:01 +0000 Received: by gwj22 with SMTP id 22so3225741gwj.35 for ; Mon, 21 Mar 2011 08:54:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=c/56uXOncR4DNs27C4xyjjlQ0Klfyf1zbR5/K0DBmZU=; b=X3FtNnjPUld2c2Q16+8zDVx/V46Dz/OFUdBNm0XB5FAjA7A9HcJUp6z6Iw2wh2er72 bpQ5YmKAO6izmjZ2QciStPAN/B61F6Msi2bWTbIsM9JRo+Fd6RYP2tQ/ZbuMq9Hb1XTv xIsTIHnF9x8spcEB4GjJ74hOx+D8AKuMfuWSg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=PzAt03iNmVYlt+jocqEKmJN3THe9jcIxBhkat8FAd3vXfiXl2jtuFEKH4ZF15Bwli4 LGYG2H1t5KYHV30TLRxIulng6tcjNgTfV5H5jF+iXu4QhW7mNDIev8eS073d04157+iq 2kVjHR65gm4TouyEjATOF8mr9CvPCCv2c7wDY= MIME-Version: 1.0 Received: by 10.42.131.67 with SMTP id y3mr4506924ics.363.1300722765619; Mon, 21 Mar 2011 08:52:45 -0700 (PDT) Received: by 10.42.177.72 with HTTP; Mon, 21 Mar 2011 08:52:45 -0700 (PDT) In-Reply-To: <4D876762.80400@elyograg.org> References: <4D7F65D0.5070909@elyograg.org> <4D876762.80400@elyograg.org> Date: Mon, 21 Mar 2011 11:52:45 -0400 Message-ID: Subject: Re: keeping data consistent between Database and Solr From: "onlinespending@gmail.com" To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=90e6ba6e864665662c049f001e68 --90e6ba6e864665662c049f001e68 Content-Type: text/plain; charset=ISO-8859-1 On Mon, Mar 21, 2011 at 10:57 AM, Shawn Heisey wrote: > On 3/15/2011 12:54 PM, onlinespending@gmail.com wrote: > >> That's pretty interesting to use the autoincrementing document ID as a way >> to keep track of what has not been indexed in Solr. And you overwrite >> this >> document ID even when you modify an existing document. Very cool. I >> suppose the number can even rotate back to 0, as long as you handle that. >> > > We use a bigint for the value, and the highest value is currently less than > 300 million, so we don't expect it to ever rotate around to 0. My build > system would not be able to handle wrapraound without manual intervention. > If we have that problem, I think we'd have to renumber the entire database > and reindex. One solution to reduce the rate at which this number grows would be to store a "batch ID" rather than a "document ID". If you've just added batch #1428 to the Solr index, then any new updated documents in your SQL database would be assigned #1429. Since you already have a unique tag ID, you may be OK with a non-unique ID for the sake of keeping track of index updates. > > > I am thinking of using a timestamp to achieve a similar thing. All >> documents >> that have been accessed after the last Solr index need to be added to the >> Solr index. In fact, each name-value pair in Cassandra has a timestamp >> associated with it, so I'm curious if I could simply use this. >> > > As long as you can guarantee that it's all deterministic and idempotent, > you can use anything you like. I hope you know what those words mean. :) > It's important when using timestamps that the system that runs the build > script is the same one that stores the last-used timestamp. That way you > are guaranteed that you will never have things getting missed because of > clock skew. Yes, that is a concern of mine. If I go with a timestamp I'll certainly need to pay close attention to things. > > > I'm curious how you handle the delta-imports. Do you have some routine >> that >> periodically checks for updates to your MySQL database via the document >> ID? >> Which language do you use for that? >> > > The entire build system is written in Perl, where I am comfortable. I even > wrote an object-oriented module that the scripts share. The update script > runs every two minutes, from cron, indexing anything with a higher document > ID than the one recorded during the last successful run. There are some > other scripts that run on longer intervals and handle things like deletes > and data redistribution into shards. These scripts kick off the build, then > use the bare /dataimport URL to track when the import completes and whether > it's successful. > Thanks, > Shawn > Thanks for the info. That's very helpful! Ben --90e6ba6e864665662c049f001e68--