lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suman Ghosh" <suman.ghos...@gmail.com>
Subject Re: StackOverflowError while calling IndexReader.deleteDocuments(new Term())
Date Tue, 28 Nov 2006 17:34:22 GMT
Mike,

Below is the pseudo code of the application. A few implementation
points to understand the pseudo-code:

  - We have a home grown threadpool class that allows us to index
    multiple documents in parallel. We usually submit 200 jobs to the
    pool (2-3 worker threads usually for the pool). Once these jobs are
    finished, we submit the next set of jobs.
  - All metadata for a document comes from a Oracle database. We
    retrieve the metadata in form of an XML document.
  - The indexing routine is designed with incremental indexing in
    mind. We intend to perform full index build once and continue
    with incremental indexing from that point onwards (on the
    average 200-300 document modified/added each day).

Here is the pseudo-code. Please feel free to point out any implementation
issues that might cause the problem.

====================BEGIN===========================
Initialization:
  get database connection
  get threadpool instance

IndexBuilder:
  for (;;) {
      get next 200 documents (from database) to be indexed. Values
      returned are a key for the document and the metadata xml

      exit if no more document available

      // first remove the documents (to be updated) from the
      // index instead of deleting and inserting them one after
      // another
      get IndexReader instance
      for all these documents {
          use reader.deleteDocuments(new Term("KEY", document key))
      }
      finally close the IndexReader instance

      // Now add these documents to the index
      get IndexWriter instance and
          set MergeFactor = 10
          set MaxMergeDocs = 100000
          set MaxFieldLength = 500000

      for all these documents {
          add a job to the threadpool with indexwriter instance
            and document metadata
      }
      and wait till jobs are finished

      finally close the IndexWriter instance
  } //end for

  get indexwriter instance and
      optimize index
  finally close the IndexWriter instance

housekeeping:
    finally close threadpool and database connection

Threadpool job:

    read individual metadata from XML and  construct a Lucene
    Document object

    Determine if there is an associated file for the document (
    usually PDF/WORD/EXCEL/PPT). If so, extract text out of that
    document and put it in a field called FULLTEXT for specific
    searching.

    Use the indexwriter instance (supplied with the job) to add
    the document to the Lucene index

====================END===========================

Thanks,

Suman


On 11/27/06, Michael McCandless <lucene@mikemccandless.com> wrote:
> Suman Ghosh wrote:
>
> > On 11/27/06, Yonik Seeley <yonik@apache.org> wrote:
> >> On 11/27/06, Suman Ghosh <suman.ghosh.1@gmail.com> wrote:
> >> > Here are the values:
> >> >
> >> > mergeFactor=10
> >> > maxMergeDocs=100000
> >> > minMergeDocs=100
> >> >
> >> > And I see your point. At the time of the crash, I have over 5000
> >> > segments. I'll try some conservative number and try to rebuild the
> >> > index.
> >>
> >> Although I don't see how those settings can produce 5000 segments,
> >> I've developed a non-recursive patch you might want to try:
> >> https://issues.apache.org/jira/browse/LUCENE-729
>
> Suman, I'd really like to understand how you're getting so many
> segments in your index.  Is this (getting 5000 segments) easy to
> reproduce?  Are you closing / reopening your writer every so often (eg
> to delete documents or something)?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message