lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Maisonneuve" <nico.maisonne...@free.fr>
Subject NGramSpeller for n field
Date Fri, 08 Oct 2004 15:08:17 GMT
hy
i would like use the David Spencer NGramSpeller for N fields in a index.
With this algorithm, 1 field i = 1 NGramSpeller index.
So if i have N fields, i must create N NgramSpeller index. ok why not... but in fact the structure
for a 5gram(for example) is : 
"word"
"transposition"
"3gram"
"4gram"
"5gram"
+ the field "freq " for the popularity of the word in the field to be processed 
+ the document is boosted during the indexation

As we see, from "word" to "5gram" (5/6 fields) the data are only dependant of the word  and
not of the data of the index to be processed. So, for N fields , i have N times the same information
from "word" field to "5gram" field in N index. it's not really optimized for n fields.

---First method ----
In fact  i would like change the field "freq" to field named "freq_nameofField". The structure
of document for the field "field1"  could be 

:
"word"
..
"5gram"
"freq_field1" ,freq for the field "field1"

so i have:
- n document for 1 word (each document have a freq field for a specific field) 
- but only 1 index . My structure of the index 

will be:
"word"
..
"5gram"
"freq_field1" 
"freq_field2" 
...
"freq_fieldn"


---Second method ----
But in the first method the 5/6 of the information of a document are redundant and not useful
(from word to 5gram field), so i would like create only 1 document for 1 word, with this structure:
"word"
..
"5gram"
"freq_field1" ,freq for the field "field1"
"freq_field2" ,freq for the field "field2"
"freq_field3" ,freq for the field "field3"

But the problem is the boosting of the document: the boost value depend on the freq and i
have differents freq  to be processed.

Have a idea to not allow redondant information in the NGramSpeller index for n field ?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message