lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Aksoy <ahme...@axtelsoft.com>
Subject Re: Indexing problems in a dictionary
Date Sat, 03 Sep 2005 21:33:38 GMT
Hi Paul,

I decided to use a minimum number of stop words in my application. I 
hope, it will work better.

According to your suggestion, I made a few trials, and found my optimum 
values.

The following values look like best in my case:
mergeFactor = 100;
minMergeDocs = 500;
maxMergeDocs = 1000;

Now, indexing is performed 8 to 30 times faster than before.
Thanks a lot.
Ahmet Aksoy

Paul Elschot wrote:

>Ahmet,
>
>On Saturday 03 September 2005 10:12, Ahmet Aksoy wrote:
>  
>
>>Hi,
>>I'm using Lucene in an open source java project at 
>>http://belletmen.dev.java.net .
>>In the project there are several dictionaries with a simple structure. 
>>All items are composed of a "phrase", and a "definition". Both parts 
>>might contain a single word, or have lots of words.
>>Since both parts  might contain multiple  words,   I used the following:
>>    private Document buildDocument(SozlukBirimi birim){
>>        Document doc = new Document();
>>        doc.add(Field.Keyword("soz", birim.getSoz()));//soz means word 
>>in Turkish
>>        doc.add(Field.Text("soz1", birim.getSoz()));//the same as 
>>keyword part
>>        doc.add(Field.Text("anlam", birim.getAnlam()));//anlam means 
>>meaning in Turkish
>>        return doc;
>>    }
>> As you can see, I used the first part both as a keyword field, and a 
>>text field. The reason is that the program will try to find phrases, or 
>>single words in the first part also.
>>At the first stages of the application, there were a single 
>>English-Turkish dictionary, and I had used an analyzer in which both 
>>English and Turkish stop words are included.
>>And, here my questions:
>>1- Do you think whether the above system is a good solution for a 
>>dictionary, or not?
>>    
>>
>
>Keyword is used for searching exactly the same term, and this is normally
>reserved for things like a primary key. When the word (soz) has that role,
>it is ok. One way to determine a primary key is to consider how to
>delete a Document from the index.
>
>In case you anticipate searching on both the words and their meanings,
>you might also consider to index them concatenated together in another field.
>
>  
>
>>2- I'm in hesitation now, about using stop words in a dictionary. What 
>>do you think?
>>    
>>
>
>For the word field, they are needed, stopwords are words.
>For the meaning field, you need to determine whether stopwords are
>necessary in the index.
>It is possible to use a different analyzer per field.
>
>  
>
>>3- I have a quite big timing problem. For a 107155 items of an 
>>English-English dictionary, it took 1436 seconds to complete the 
>>indexing on a 600MHz Pentium 4 Laptop with 256 MB of memory. Is it 
>>normal? Or, am I in a completely wrong way?
>>    
>>
>
>Since each entry is probably rather small, it is possible to keep many
>of them in RAM before writing them to disk during index building.
>You could try and increase IndexWriter.minMergeFactor first
>and consider the other merge factors of IndexWriter there later.
>
>Regards,
>Paul Elschot
>
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message