lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Will Allen" <>
Subject RE: Acedemic Question About Indexing
Date Thu, 11 Nov 2004 19:21:49 GMT
I have a servlet that instanciates a multisearcher on 6 indexes:
(du -h)
7.2G    ./0
7.2G    ./1
7.2G    ./2
7.2G    ./3
7.2G    ./4
7.2G    ./5
43G     .

I recreate the index from scratch each month based upon a 50gig zip file with all of the 40
million documents.  I wanted to keep my indexing speed as low as possible, without hurting
search performace too much, as each searcher allocates a certain amount of memory proportional
to the number of terms it has.  A single large index has a lot of overlap in terms, so it
needs less memory than multiple indexes.

Anyway, for indexing, I am able to index ~100 documents per second.  The total indexing process
takes 2.5 days.  I have a powerful machine with 2 hyperthreaded processors (linux sees 4 processors)
and 1GB ram.  I also have pretty fast SCSI disks.

I perform no updates or deletes on my indexes.

The indexing process equally divides the work amongst the indexers.  The bottleneck of the
indexing process is not memory or CPU, rather disk IO of 6 writers.  If I had faster disks,
I could create more indexers.

-----Original Message-----
From: Sodel Vazquez-Reyes
Sent: Thursday, November 11, 2004 11:37 AM
To: Lucene Users List
Cc: Will Allen
Subject: Re: Acedemic Question About Indexing

could you give more details about your architecture?
-each time update o create new indexes
-data stored at each index

because it is quite interesting, and I would like to test it.


Quoting Luke Shannon <>:

> 40 Million! Wow. Ok this is the kind of answer I was looking for. The site I
> am working on indexes maybe 1000 at any given time. I think I am ok with a
> single index.
> Thanks.
> ----- Original Message -----
> From: "Will Allen" <>
> To: "Lucene Users List" <>
> Sent: Wednesday, November 10, 2004 7:23 PM
> Subject: RE: Acedemic Question About Indexing
> I have an application that I run monthly that indexes 40 million documents
> into 6 indexes, then uses a multisearcher.  The advantage for me is that I
> can have multiple writers indexing 1/6 of that total data reducing the time
> it takes to index by about 5X.
> -----Original Message-----
> From: Luke Shannon []
> Sent: Wednesday, November 10, 2004 2:39 PM
> To: Lucene Users List
> Subject: Re: Acedemic Question About Indexing
> Don't worry, regardless of what I learn in this forum I am telling my
> company to get me a copy of that bad boy when it comes out (which as far as
> I am concerned can't be soon enough). I will pay for grama's myself.
> I think I have reviewed the code you are referring to and have something
> similar working in my own indexer (using the "uid"). All is well.
> My stupid question for the day is why would you ever want multiple indexes
> running if you can build one smart indexer that does everything as
> efficiently as possible? Does the answer to this question move me to multi
> threaded indexing territory?
> Thanks,
> Luke
> ----- Original Message -----
> From: "Otis Gospodnetic" <>
> To: "Lucene Users List" <>
> Sent: Wednesday, November 10, 2004 2:08 PM
> Subject: Re: Acedemic Question About Indexing
>> Uh, I hate to market it, but.... it's in the book.  But you don't have
>> to wait for it, as there already is a Lucene demo that does what you
>> described.  I am not sure if the demo always recreates the index or
>> whether it deletes and re-adds only the new and modified files, but if
>> it's the former, you would only need to modify the demo a little bit to
>> check the timestamps of File objects and compare them to those stored
>> in the index (if they are being stored - if not, you should add a field
>> to hold that data)
>> Otis
>> --- Luke Shannon <> wrote:
>> > I am working on debugging an existing Lucene implementation.
>> >
>> > Before I started, I built a demo to understand Lucene. In my demo I
>> > indexed
>> > the entire content hierarhcy all at once, and than optimize this
>> > index and
>> > used it for queries. It was time consuming but very simply.
>> >
>> > The code I am currently trying to fix indexes the content hierarchy
>> > by
>> > folder creating a seperate index for each one. Thus it ends up with a
>> > bunch
>> > of indexes. I still don't understand how this works (I am assuming
>> > they get
>> > merged someone that I have tracked down yet) but I have noticed it
>> > doesn't
>> > always index the right folder. This results in the users reporting
>> > "inconsistant" behavior in searching after they make a change to a
>> > document.
>> > To keep things simiple I would like to remove all the logic that
>> > figures out
>> > which folder to index and just do them all (usually less than 1000
>> > files) so
>> > I end up with one index.
>> >
>> > Would indexing time be the only area I would be losing out in, or is
>> > there
>> > something more to the approach of creating multiple indexes and
>> > merging
>> > them.
>> >
>> > What is a good approach I can take to indexing a content hierarchy
>> > composed
>> > primarily of pdf, xsl, doc and xml where any of these documents can
>> > be
>> > changed several times a day?
>> >
>> > Thanks,
>> >
>> > Luke
>> >
>> >
>> >

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message