lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noble Paul നോബിള്‍ नोब्ळ् <noble.p...@gmail.com>
Subject Re: DataImportHandler Robustness For Imports That Take A Long Time
Date Sat, 14 Mar 2009 03:58:04 GMT
alternately you can do the commit yourself after marking in the db
. Context#getSolrCore().getUpdateHandler().commit()

or as you mentioned you can do an autocommit

On Sat, Mar 14, 2009 at 12:31 AM, Chris Harris <ryguasu@gmail.com> wrote:
> Wouldn't this approach get confused if there was an error that caused
> DIH to do a rollback? For example, suppose this happened:
>
> * 1000 successful document adds
> * The custom transformer saves some marker in the DB to signal that
> the above docs have been successfully indexed
> * The next document add throws an exception
> * DIH, rather than doing a commit, rolls back the 1000 document adds
>
> At this point my database marker says that the 1000 docs have been
> successfully indexed, but the documents themselves are not actually in
> the Solr index. Because by hypothesis my import query is defined in
> terms of my DB marker, I'll never end up getting these docs into the
> Solr index, even if I resolve the issue that causes the exception and
> re-run the data import.
>
> It seems like, to do a safe equivalent of your suggestion, I'd have to
> somehow A) prevent DIH from doing any rollbacks, B) get DIH to do
> auto-commits, and C) make my custom transformer update the DB marker
> only immediately after an auto-commit.
>
> On Mon, Mar 9, 2009 at 9:27 PM, Noble Paul നോബിള്‍  नोब्ळ्
> <noble.paul@gmail.com> wrote:
>> I recommend writing a simple transformer which can write an entry
>> into db after n documents (say 1000). and modify your query to take to
>> consider that entry so that subsequent imports will start from there.
>>
>> DIH does not write the last_index_time unless the import completes successfully.
>>
>> On Tue, Mar 10, 2009 at 1:54 AM, Chris Harris <ryguasu@gmail.com> wrote:
>>> I have a dataset (7M-ish docs each of which is maybe 1-100K) that,
>>> with my current indexing process, takes a few days or maybe a week to
>>> put into Solr.  I'm considering maybe switching to indexing with the
>>> DataImportHandler, but I'm concerned about the impact of this on
>>> indexing robustness:
>>>
>>> If I understand DIH properly, then if Solr goes down for whatever
>>> reason during an import, then DIH loses track of what it has and
>>> hasn't yet indexed that round, and will thus probably do a lot of
>>> redundant reimporting the next time you run an import command. (For
>>> example, if DIH successfully imports row id 100, and then Solr dies
>>> before the DIH import finishes, and then I restart Solr and start a
>>> new delta-import, then I think DIH will import row id 100 again.) One
>>> implication for my dataset seems to be that, unless Solr can actually
>>> stay up for several days on end, then DIH will never finish importing
>>> my data, even if I manage to keep Solr at, say, 99% uptime. This would
>>> be fine if a full import took only a few hours. If full import could
>>> take a week, though, this is slightly unnerving. (Sometimes you just
>>> need to restart Solr. Or the machine itself, for that matter.)
>>>
>>> Are there any good ways around this with DIH? One potential option is
>>> to give each row in the database table not only a
>>> ModificationTimestamp column but also a DataImportHandlerTimestamp
>>> column, and try to get DIH to update that column whenever it finishes
>>> indexing a row. Then you'd modify the WHERE clause in the DIH config
>>> so that instead of determining which rows to index with something like
>>>
>>>  WHERE ModificationTimestamp > dataimporter.last_index_time
>>>
>>> you'd use something like
>>>
>>>  WHERE ModificationTimestamp > SolrImportTimestamp
>>>
>>> In this way, hopefully, DIH can always pick up where it left off last time,
>>> rather than trying to redo any work it might have actually managed
>>> to do last round.
>>>
>>> (I'm using something along these lines with my current, non-DIH-based
>>> indexing scheme.)
>>>
>>> Am I making sense here?
>>>
>>> Chris
>>>
>>
>>
>>
>> --
>> --Noble Paul
>>
>



-- 
--Noble Paul

Mime
View raw message