lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <cdor...@gmail.com>
Subject Re: changing term freq in indexing time
Date Tue, 21 Apr 2009 12:08:46 GMT
Hi Liat, there are two packages under Lucene's contrib that deals with
Synonyms - that is contrib/memory and contrib/wordnet - which you
may find useful. I never used these two but they seem relevant to what
you are trying to achieve.

Anyhow, it seems you compute the synonyms for word w are those
that appear in the same set of documents ('worlds') as w, and you find
this set by (a) indexing an inverse of the collection (docs become words
and words become docs) and (b) using docs(w) as query do find syns(w).

I assume that your 'worlds' are small, each containing only a small
set of a few related words, otherwise I would have two
concerns with this approach: (a) scalability (b) in a large doc (world) this
approach ignores the vicinity of words which seems to me important
to their likelihood as synonyms

Assuming you are okay here, and going back to original question of
altering the term frequency, perhaps taking the (search) scores of the
returned synonyms (which you find by search) is better than just
using their frequency? If you find this approach valid, then at least for
some queries you should be able to use queries boosts. For example
create a BooleanQuery, add to it a TermQuery for each synonym,
but set the boost of the TermQuery according to the synonnym score.
This is also where you could "punish" synnonyms comparing to the
original word. This will only help with queries with contruction API
that takes (sub) queries as input (so it will not help with a PhraseQuery).

- Doron

On Tue, Apr 21, 2009 at 12:40 PM, liat oren <oren.liat@gmail.com> wrote:

> Ok, I will explain the full 'problem' and then explain how I approach it:
>
> Lets divide it into three steps:
>
> 1. I have a 'dictionary' of words - for every word, I have a list of
> worlds,
> which are ids of text documents that the word appears in.
> So, for example, for the word 'dog', I have '1 1600 36000' in the "worlds"
> field (which are tokenized whin indexed) - which means that the word dog
> appears in worlds 1, 1600 and 36000.
>
> 2. This index is used to choose synonyms for the word dog - using the
> "worlds" field - I do a search on this index, giving the query "'1 1600
> 36000" as in input and thus get the words that are close to the word "dog".
> I take the 10 closest words.
>
> 3. These 10 synonyms are then used to expand the query.
>
> Basically, I have 2 problems in this process:
>
> a. In the process of finding the synonyms, I would like that the frequency
> of the word in each of the worlds will be taken into account. so that if
> 'dog' appeared 3 times in world 1, 10 times in world 1600 and 4 times in
> world 36000, then it will be taken into account.
> I wanted to avoid "expanding" the field to be "1 1 1 1600 1600 1600 1600
> 1600 1600 1600 1600 1600 1600 36000 36000 36000 36000". Accordingly I
> wanted
> to be able to set the freq by myself.
>
> b. In the process of using the synonyms, I wanted to be able to set a
> 'penalty' factor to the synonyms, together with giving differnt weight to
> differnt synonyms, according to theur score. I looked at an old thread -
> Search for synonyms - implemenetation for review :
> .
>
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200603.mbox/%3c39B0FB508E5D7540ACA5AD57225E150D39203D@xmail.me.corp.entopia.com%3e
>
> I don;t know if its part of lucene now. I didn't quite understand how to
> use
> it.
> Is there a better way to approach it?
>
> I hope I explained it well.
> Thanks,
> Liat
>
>
>
> 2009/4/21 Doron Cohen <cdoronc@gmail.com>
>
> > Depending on the problem you are trying to solve there may be other
> > solutions to it, not requiring setting wrong (?) values for term
> > frequencies.
> > If you can explain what you are trying to solve, people on the list may
> > be able to suggest such alternatives.
> > - Doron
> >
> > On Sun, Apr 19, 2009 at 2:39 PM, liat oren <oren.liat@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I would like to be able to set the term freq to differnt values at
> index
> > > time, or at search time.
> > >
> > > So if a document has the following text: 1 2, the freq of 1 will get
> 100
> > > and
> > > the freq of 2 will get 200. I want to avoid expanding it by writing 1
> 100
> > > times.
> > >
> > > I looked at Similarity class and wanted to override it, but the tf
> > function
> > > gets only freq, so I don't know for which term this freq relates to,
> thus
> > I
> > > can't change the value.
> > >
> > > Thanks,
> > > Liat
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message