lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fuad Efendi <>
Subject Re: Best practice advice needed!
Date Thu, 25 Sep 2008 20:53:31 GMT
About web spiders: I simply use "last modified timestamp" field in  
SOLR, and I expire items after 30 days. If item was updated (timestamp  
changed) - it won't be deleted. If I delete it from database - it will  
be deleted from SOLR within 30 days. Spiders don't need  
'transactional' updates.

Recently I moved to HBase from MySQL. "row::column" structure is  
physically sorted, column-oriented structure. SOLR lazily follows  
database updates; it's very specific case...

Quoting Walter Underwood <>:

> That should be "flag it in a boolean column". --wunder
> On 9/25/08 11:51 AM, "Walter Underwood" <> wrote:
>> This will cause the result counts to be wrong and the "deleted" docs
>> will stay in the search index forever.
>> Some approaches for incremental update:
>> * full sweep garbage collection: fetch every ID in the Solr DB and
>> check whether that exists in the source DB, then delete the ones
>> that don't exist.
>> * mark for deletion: change the DB to leave the record but flag it
>> as deleted in a boolean row, then delete from Solr all deleted
>> items in the source DB. The items marked for deletion can be
>> deleted from the source DB at a later time.
>> * indexer scratchpad DB: a database used by the indexing code which
>> shows all the IDs currently in the index, usually with a last modified
>> time. This is similar to the full sweep, but may be much faster with
>> a dedicated DB. This can get arbitrarily fancy. Web spiders work like this.
>> wunder
>> On 9/25/08 10:08 AM, "Fuad Efendi" <> wrote:
>>> I am guessing your Enterprise system deletes/updates tables in RDBMS,
>>> and your SOLR indexes that data. Additionally to that, you have
>>> front-end interacting with SOLR and with RDBMS. At front-end level, in
>>> case of a search sent to SOLR returning primary keys for data, you may
>>> check your database using primary keys returned by SOLR before
>>> committing output to end users.

View raw message