lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From liat oren <oren.l...@gmail.com>
Subject Re: changing term freq in indexing time
Date Tue, 21 Apr 2009 09:40:11 GMT
Ok, I will explain the full 'problem' and then explain how I approach it:

Lets divide it into three steps:

1. I have a 'dictionary' of words - for every word, I have a list of worlds,
which are ids of text documents that the word appears in.
So, for example, for the word 'dog', I have '1 1600 36000' in the "worlds"
field (which are tokenized whin indexed) - which means that the word dog
appears in worlds 1, 1600 and 36000.

2. This index is used to choose synonyms for the word dog - using the
"worlds" field - I do a search on this index, giving the query "'1 1600
36000" as in input and thus get the words that are close to the word "dog".
I take the 10 closest words.

3. These 10 synonyms are then used to expand the query.

Basically, I have 2 problems in this process:

a. In the process of finding the synonyms, I would like that the frequency
of the word in each of the worlds will be taken into account. so that if
'dog' appeared 3 times in world 1, 10 times in world 1600 and 4 times in
world 36000, then it will be taken into account.
I wanted to avoid "expanding" the field to be "1 1 1 1600 1600 1600 1600
1600 1600 1600 1600 1600 1600 36000 36000 36000 36000". Accordingly I wanted
to be able to set the freq by myself.

b. In the process of using the synonyms, I wanted to be able to set a
'penalty' factor to the synonyms, together with giving differnt weight to
differnt synonyms, according to theur score. I looked at an old thread -
Search for synonyms - implemenetation for review :
.
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200603.mbox/%3c39B0FB508E5D7540ACA5AD57225E150D39203D@xmail.me.corp.entopia.com%3e

I don;t know if its part of lucene now. I didn't quite understand how to use
it.
Is there a better way to approach it?

I hope I explained it well.
Thanks,
Liat



2009/4/21 Doron Cohen <cdoronc@gmail.com>

> Depending on the problem you are trying to solve there may be other
> solutions to it, not requiring setting wrong (?) values for term
> frequencies.
> If you can explain what you are trying to solve, people on the list may
> be able to suggest such alternatives.
> - Doron
>
> On Sun, Apr 19, 2009 at 2:39 PM, liat oren <oren.liat@gmail.com> wrote:
>
> > Hi,
> >
> > I would like to be able to set the term freq to differnt values at index
> > time, or at search time.
> >
> > So if a document has the following text: 1 2, the freq of 1 will get 100
> > and
> > the freq of 2 will get 200. I want to avoid expanding it by writing 1 100
> > times.
> >
> > I looked at Similarity class and wanted to override it, but the tf
> function
> > gets only freq, so I don't know for which term this freq relates to, thus
> I
> > can't change the value.
> >
> > Thanks,
> > Liat
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message