lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lokeya <lok...@gmail.com>
Subject Re: Issue while parsing XML files due to control characters, help appreciated.
Date Thu, 22 Mar 2007 05:09:06 GMT

Initially I was writing into the Index 7,00,000 times. I chaged the code to
now write only 70 times which means I am putting lot of data in an array
list and add to doc and index at one shot. This is where the improvement
came from. To be precise IndexWriter is now adding document 70 times Vs.
7,00,000 times. 

If this is not clear please let me know. I have't pasted the latest code
where I have fixed the lock issue as well. If required I can do that.

Thanks everyone for quick turnaround and it really helped me a lot.


Doron Cohen wrote:
> 
> Lokeya <lokeya@gmail.com> wrote on 18/03/2007 13:19:45:
> 
>>
>> Yep I did that, and now my code looks as follows.
>> The time taken for indexing one file is now
>> => Elapsed Time in Minutes :: 0.3531
>> which is really great
> 
> I am jumping in late so appologies if I am missing something.
> However I don't understand where this improvement came
> from. (more below)
> 
>>
>> //Create an Index under LUCENE
>> IndexWriter writer = new IndexWriter("./LUCENE", new
>>                                      StopStemmingAnalyzer(),
>>                                      false);
>> Document doc = new Document();
>>
>> //Get Array List Elements  and add them as fileds to doc
>> for(int k=0; k < alist_Title.size(); k++)             {
>>   doc.add(new Field("Title",
>>                     alist_Title.get(k).toString(),
>>                     Field.Store.YES,
>>                     Field.Index.UN_TOKENIZED));
>> }
>>
>> for(int k=0; k < alist_Descr.size(); k++)             {
>>   doc.add(new Field("Description",
>>                     alist_Descr.get(k).toString(),
>>                     Field.Store.YES,
>>                     Field.Index.UN_TOKENIZED));
>> }
>>
>> //Add the document created out of those fields to
>> // the IndexWriter which will create and index
>> writer.addDocument(doc);
>> writer.optimize();
>> writer.close();
>>
>> long elapsedTimeMillis = System.currentTimeMillis()-startTime;
>> System.out.println("Elapsed Time for  "+
>>                    dirName +" :: " + elapsedTimeMillis);
>> float elapsedTimeMin = elapsedTimeMillis/(60*1000F);
>> System.out.println("Elapsed Time in Minutes ::  "+
>>                    elapsedTimeMin);
> 
> the code listed above is, still, for each indexed document,
> doing this:
> 
>   1. open index writer
>   2. create a document (and add fields to it)
>   3. add the doc to the index
>   4. optimize the index
>   5. close the index writer
> 
> As already mentioned here this is very non optimal.
> So I don't understand where the improvement came from.
> 
> It should be done like this:
> 
>   1. open/create the index
>   2. for each valid doc input {
>        a. create a document (add fields)
>        b. add the doc to the index
>      }
>   3. optimize the index (optionally)
>   4. close the index
> 
> It would probably also help both coding and discussion to capture
> some logic in methods, e.g.
> 
> IndexWriter createIndex();
> IndexWriter openIndex();
> void closeIndex();
> Document createDocument(String oai_citeseer_fileName)
> 
>> but after processing 4 dumpfiles(which means
>> 40,000 small xml's), I get :
>>
>> caught a class java.io.IOException
>>  40114  with message: Lock obtain timed out:
>> Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
>>
>>
>> This is the new issue now.What could be the reason ?. I am surprised
> because
>> I am only writing to Index under ./LUCENE/ evrytime and not doing
> anything
>> with the index(ofcrs to avoid such synchronization issues !)
> 
> If the code listed above is getting this exception, and this runs on
> Windows, then perhaps this is related to issue 665 - intensive IO
> activity, write lock acquired and released for each and every added
> document. If so, this too would be avoided by opening the index just once.
> 
> Last comment - I did not understand if a single index is used for all
> 70 metadata files, or each one has its own index, totalling 70 indexes.
> I would think it should be only one index for all, and then you better
> avoid calling optimize 70 times - that would waste time too. Better
> organize the code like this:
> 
> 1. create/open index
> 2. for each meta data file {
>    a. for each "valid element" in meta data file {
>         a.1 create document
>         a.2 add doc to index
>       }
>    }
> 3. optimize
> 4. close index
> 
> Regards,
> Doron
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9608570
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message