lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Indexer large file and hi performance indexing
Date Thu, 07 Sep 2006 13:31:12 GMT
Here's another approach that I *think* will work...

Remember that opening an IndexReader takes a snapshot of the index, and
later additions aren't visible until you open a new index reader.

So, at some point you have an FSDir index only, and your program starts.

You open your (shared) reader to the FSDir index.
You open a writer to BOTH the FSDIr and a (new) RAMdir.

Now, when each document comes in you add it to the FSDir index AND the
RAMDir indexes.

You search both the FSDir index and the RAMDir index for all of your
searches, opening a new RAMDir reader for each search.

 Periodically, you close and re-open your FSDir index reader and discard the
RAMDir index and start a new RAMDir index that you use as above.

Now, your searches are completely up to date. You don't have to worry
merging the indexes because you already added the docs to the FSDir index.
You should be able to keep using the old searcher until the
close/optimization is complete. I suspect that you'll have to do something
here about not adding new docs to the RAMDIR index while opening/closing
your FSDir index, but that's an exercise for the reader <G>...

Since I haven't actually done this myself, I'd recommend some pretty
thorough tests, particularly to insure that you don't get duplicate results
<G>.

Best
Erick

P.S. Wrapped around this is the explicit assumption that whether or not the
data is flushed to disk, it is *still* unavaliable to an IndexReader until
you open a new reader.

P.P.S. I'd really push back at whoever is insisting that all searches be
real-time. It's often been my experience that "real-time" really means
"every 10 minutes will be fine", in which case your work would be much
easier, but only you can answer that....

On 9/7/06, HODAC, Olivier <olivier.hodac@airbus.com> wrote:
>
>
> Actually, latency is not possible.
>
> Do you think it is possible to tune the fswriter to flush into the file
> system each N elements (using the maxMergeDocs and co) and use "for each
> docs of the RAMindexer (addDocument to the FSindexer)"? What about
> performances?
>
> -----Message d'origine-----
> De : Doron Cohen [mailto:DORONC@il.ibm.com]
> Envoyé : mercredi 6 septembre 2006 21:00
> À : java-user@lucene.apache.org
> Objet : Re: Indexer large file and hi performance indexing
>
>
>
> "HODAC, Olivier" <olivier.hodac@airbus.com> wrote on 06/09/2006 03:04:15:
> >
> > hello,
> >
> > I design an application which bottleneck concerns the indexing
> > process. Objects indexation blocks the user's action. Furthermore, I
> > need to index a large maount of documents (30000 per day) and save
> > them on the file system.
> >
> > The first developments have been initiate with lucene 1.4.3 and I
> > had to create a cachedIndexer which uses a RAMDirectory and a FS
> > directory. For the searches, the cached indexer searches in RAM and
> > FS indexes. For deletion, the delete is done in the RAM and FS.
> >
> > Each N documents to index in the cache (RAMDirectory), a thread
> > dumps them to the FSDirectory, with a synchronize.
> >
> > In 1 month, I reached an index cfs file of 2.5Go, and the merge
> > process (addIndexes) takes 5 minutes approx (solaris 5.9).!!!! =>
> > synchronize...
> >
> > Questions:
> > * Is it a good way of doing things?
>
> I am guessing that you implemented "searching in the RAM segments held by
> the writer" for making the search results as fresh as possible - i.e.
> reflecting in search every last document that was added.
>
> If this is not the case - i.e. if you could tolerate a latency of N
> documents (say 1000) and T minutes (say 15), whichever happens first, you
> could define maxBefferredDocs = N, this way letting Lucene manage the
> index
> on its own, and open the searcher(s) against the FSDirectory, as usual.
> This way you would be able to search while the index gets updated, even
> during merges. You would of course need to create a new searcher after
> flushes, in order for search to reflect recent index modifications. (The
> 'flush after T' logic is not in Lucene, so if desired it would need to be
> implemented above it.)
>
> If such latency is not possible, and search must include every recent
> document added, I would assume that a more fine synchronization could work
> - for instance - do not sync on the entire merge, but only on the deletion
> of merged segments from the queue (keep 'old=to-be-deleted' segments aside
> as long as existing searchers still use them). This is much more
> complicated though, it would take more memory, and would most possibly
> break when the source of Lucene evolves, so better do not take this path
> unless must.
>
> > * Does maxMergeDocs is a solution for the merge process to be lower?
> > * If I call addIndexes on the 2.5Go index file with a maxMergeDocs,
> > it calls optimize. the javadoc says that it calls optimize, which
> > gathers to 1 file the indexdir. is it true whatever the maxMergeDocs?
>
> maxMergeDocs only affect merge decisions during addDocument(), so it would
> have no effect during optimize() nor during addIndexes() that calls
> optimize() - there the index would be merged into a single segment.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> This mail has originated outside your organization,
> either from an external partner or the Global Internet.
> Keep this in mind if you answer this message.
>
>
> This e-mail is intended only for the above addressee. It may contain
> privileged information. If you are not the addressee you must not copy,
> distribute, disclose or use any of the information in it. If you have
> received it in error please delete it and immediately notify the sender.
> Security Notice: all e-mail, sent to or from this address, may be
> accessed by someone other than the recipient, for system management and
> security reasons. This access is controlled under Regulation of
> Investigatory Powers Act 2000, Lawful Business Practises.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message