lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Indexing problems in a dictionary
Date Sat, 03 Sep 2005 18:46:11 GMT
Ahmet,

On Saturday 03 September 2005 10:12, Ahmet Aksoy wrote:
> Hi,
> I'm using Lucene in an open source java project at 
> http://belletmen.dev.java.net .
> In the project there are several dictionaries with a simple structure. 
> All items are composed of a "phrase", and a "definition". Both parts 
> might contain a single word, or have lots of words.
> Since both parts  might contain multiple  words,   I used the following:
>     private Document buildDocument(SozlukBirimi birim){
>         Document doc = new Document();
>         doc.add(Field.Keyword("soz", birim.getSoz()));//soz means word 
> in Turkish
>         doc.add(Field.Text("soz1", birim.getSoz()));//the same as 
> keyword part
>         doc.add(Field.Text("anlam", birim.getAnlam()));//anlam means 
> meaning in Turkish
>         return doc;
>     }
>  As you can see, I used the first part both as a keyword field, and a 
> text field. The reason is that the program will try to find phrases, or 
> single words in the first part also.
> At the first stages of the application, there were a single 
> English-Turkish dictionary, and I had used an analyzer in which both 
> English and Turkish stop words are included.
> And, here my questions:
> 1- Do you think whether the above system is a good solution for a 
> dictionary, or not?

Keyword is used for searching exactly the same term, and this is normally
reserved for things like a primary key. When the word (soz) has that role,
it is ok. One way to determine a primary key is to consider how to
delete a Document from the index.

In case you anticipate searching on both the words and their meanings,
you might also consider to index them concatenated together in another field.

> 2- I'm in hesitation now, about using stop words in a dictionary. What 
> do you think?

For the word field, they are needed, stopwords are words.
For the meaning field, you need to determine whether stopwords are
necessary in the index.
It is possible to use a different analyzer per field.

> 3- I have a quite big timing problem. For a 107155 items of an 
> English-English dictionary, it took 1436 seconds to complete the 
> indexing on a 600MHz Pentium 4 Laptop with 256 MB of memory. Is it 
> normal? Or, am I in a completely wrong way?

Since each entry is probably rather small, it is possible to keep many
of them in RAM before writing them to disk during index building.
You could try and increase IndexWriter.minMergeFactor first
and consider the other merge factors of IndexWriter there later.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message