lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lichtner, Guglielmo" <>
Subject RE: Indexing very large sets (10 million docs)
Date Mon, 28 Jul 2003 15:59:43 GMT

That's 46 hits/s. That's not bad, actually.

It's an interesting problem. It certainly seems that when index such a large
number of documents the indexing should be parallel. So far I have assumed
Lucene is not able to use multiple threads to speed up the indexing run.

If it did, I guess it would help you assuming that 1) the bottleneck is in
the indexing
and not on the Oracle query (pretty likely) and 2) that your i/o is not the
- if it were it might be possible to spread out the index across multiple

If there is a natural way to partition the documents (there might be, since
they are so many), then you can manage the partition yourself and just
search on 
multiple indexes. It probably also makes the system more robust. In case
accidentally deletes one of the indices. Also, since the each index (I
think) has
one lock file, it should definitely increase concurrency when searching ..

Perhaps Doug Cutting can comment on the parallelization issue?

-----Original Message-----
From: Roger Ford []
Sent: Monday, July 28, 2003 11:23 AM
Subject: Indexing very large sets (10 million docs)

I'm trying to index 10 million small XML-like documents, extracted
from an Oracle database.

Lucene version is 1.2, and I'm using RedHat 7.0 Advanced Server,
on an AMD XP1800+ with 1GB RAM and 46GB+120GB hard disks. The
database is on a separate machine, connected by thin JDBC.

Each document consists of around 10 fields of the form

<field1>some text</field1>
<field2>more text</field2>
<field10> again</field10>

Each document is typically only around 2K in size. Each field is 
free-text indexed, but only the "key" field is stored.

After experimenting, I've set
   Java memory to 750MB
   writer.mergeFactor = 10000
   - and run an optimize every 50,000 documents to try to keep down the
number of open files.

So far, the furthest I've managed to get is 3 million docs,
which used

   22,913 simultaneous open files (the maximum on Windows is 2035!)
   Up to 79 GB of disk space
   Around 18 hours of indexing time

Is what I'm trying to do realistic?  Does anyone else have experience
of indexing collections this large?

My indexing code, which is based on the simple file demo, can be seen

- Many thanks,


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message