lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nader Henein <...@bayt.net>
Subject Re: incrementally indexing a million documents
Date Tue, 15 Jun 2004 07:47:59 GMT
How are you documents named? is it alphabetical or numerical, Mine where 
numerical so I I creates n directories like so
11 , 12, 13, 14, .... 19, 21 , 22 , 23 .. ........ 99   you get the idea 
and I stored the files into the directories that each belonged to 
depending on the last two numbers in the file name (you could use file 
size to shuffle the files around as well (ie, use the 2 rightmost  
numbers in the file size in bytes)  so at this point you'll have 
shuffled your million docs into 100 directories and then  Lucene can 
spider through each set of directories indexing let's say 5000 files at 
a time and then deleting them or moving them into another location, it 
you get 100 million files simply up the precision on the directory to a 
3 digit setup or a 4 digit setup (once you automate it, sky's the limit) 

Hope this helps

Nader Henein


Michael Wechner wrote:

> I try to index around a million documents. The problem is
> that I run out of memory during sorting by uid when I go through
> the directory recursively.
>
> Well, I could add more memory, but this wouldn't really solve my problem,
> because at some point I will always run out of memory (e.g. 10 million 
> documents).
>
> Is there another approach than sorting by uid?
>
> Thanks
>
> Michi
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message