lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Cadena <robert.cad...@gmail.com>
Subject Re: how to index large number of files?
Date Thu, 21 Oct 2010 19:34:39 GMT
Take a look at this question and associated answer over on stackoverflow:

http://stackoverflow.com/questions/354703/is-there-a-workaround-for-javas-poor-performance-on-walking-huge-directories

It's not all inside java but it might work for you and you might not have to restructure your
files. 



Sent from my iPhone

On Oct 21, 2010, at 12:26 PM, Sahin Buyrukbilen <sahin.buyrukbilen@gmail.com> wrote:

> I dont know why I am getting this error, but it looks normal to me now.
> because when I try to list the contents of the folder  I cannot get a
> response from linux shell. Now I have created a folder with 100.000 files
> and running eclipse with -Xmx2G parameter. it is still indexing for about 15
> minutes now, but I am happy it works.
> 
> After this I will try Toke's method. Create 100.000 filed folders and try to
> index them recursively.
> 
> On Thu, Oct 21, 2010 at 4:57 AM, Toke Eskildsen <te@statsbiblioteket.dk>wrote:
> 
>> On Thu, 2010-10-21 at 05:01 +0200, Sahin Buyrukbilen wrote:
>>> Unfortunately both methods didnt go through. I am getting memory error
>> even
>>> at reading the directory contents.
>> 
>> Then your problem is probably not Lucene related, but the sheer number
>> of files returned by listFiles.
>> 
>> A Java File contains the full path name for the file. Let's say that
>> this is 50 characters, which translates to about (50 * 2 + 45) ~ 150
>> bytes for the Java String. Add an int (4 bytes) plus bookkeeping and
>> we're up to about 200 bytes/File.
>> 
>> 4.5 million Files thus takes up about 1 GB. Not enough to explain the
>> OOM, but if the full path name of your files is 150 characters, the list
>> takes up 2 GB.
>> 
>>> Now, I am thinking this: What if I split 4.5million files into 100.000
>> (or
>>> less depending on java error) files directories, index each of them
>>> separately and merge those indexes(if possible).
>> 
>> You don't need to create separate indexes and merge them. Just split
>> your 4.5 million files into folders of more manageable sizes and perform
>> a recursive descend. Something like
>> 
>> public static void addFolder(IndexWriter writer, File folder) {
>> File[] files = folder.listFiles();
>> for (File file: files) {
>>   if (file.isDirectory()) {
>>     addFolder(writer, file);
>>   } else {
>>     // Create Document from file and add it using the writer
>>   }
>> }
>> }
>> 
>> - Toke
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message