lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eran Sevi <erans...@gmail.com>
Subject Re: changing term freq in indexing time
Date Wed, 22 Apr 2009 12:52:31 GMT
Hi,
I'm no expert on the subject but it seems like you're searching for one term
that should be "3 3 2 1" (why do you write "3" two times anyway?).
I think you should try a regulalr boolean query where each sub-query is a
BoostingTermQuery on one term only. These queries should be used with
Occur.MUST if you want the word to be in all these worlds.
Maybe you should search the archives on the proper use of Boosting*Query.
Regarding the synonyms - it looks quite OK to me. Maybe you should try to
use ony Occur.MUST for all TermQuery instances. A simple debugging should
also give you some clue about what is the problem.
Good luck, Eran.

On Wed, Apr 22, 2009 at 1:52 PM, liat oren <oren.liat@gmail.com> wrote:

> Thanks Eran, I tried it, adding the classes I copied below and tried to run
> the following
> code:
>
> [Also I have below a question about the usage of synonyms and
> BooleanQuery.]
>
>  DoubleMap wordMap = new DoubleMap();
>  wordMap.insert("1", 1, 5); // for word "1" we have the world 1, 5 times
>  wordMap.insert("1", 2, 2);// for word "1" we have the world 2, 2 times
>  wordMap.insert("1", 3, 7);
>  wordMap.insert("1", 4, 1);
>  wordMap.insert("2", 3, 1); // for word "2" we have the world 3, 1 time
>  wordMap.insert("2", 5, 1);
>  wordMap.insert("2", 6, 1);
>  wordMap.insert("3", 3, 1);
>  wordMap.insert("3", 4, 1);
>  wordMap.insert("3", 8, 1);
>  ioManager io = new ioManager();
>  io.index(wordMap, "TestSearchIndex", "", "1");
>
>  IndexSearcher searcher = new IndexSearcher("TestSearchIndex");
>  searcher.setSimilarity(new WordsSimilarity()); // WordsSimilarity is
> written below
>  Query btq = new BoostingTermQuery(new Term(WordIndex.FIELD_WORLDS, "3 3 2
> 1"));
>  Hits wordsHits = searcher.search(btq);
>
> From some reason the hits size is 0 and none of the methods overriden in
> WordsSimilarity is called (I put a breakpoint and it didn;t get there
> during
> search time)
>
> public class *WordsAnalyzer* extends Analyzer
> {
>  public Map<String, Map<String, Integer>> wordsWorldsFreq = new
> HashMap<String, Map<String, Integer>>();
>  public Map<String, Integer> worldsFreq = new HashMap<String, Integer>();
>  public WordsAnalyzer()
>  {
>  }
>  public WordsAnalyzer(Map<String, Integer> worldsFreq) throws IOException
>  {
>  this.worldsFreq = worldsFreq;
>  }
>  public TokenStream tokenStream(String fieldName, Reader reader)
>  {
>  return new WordsFilter(new StandardTokenizer(reader), worldsFreq);
>  }
> }
>
> public class *WordsFilter* extends TokenFilter
> {
>  public Map<String, Integer> worldsFreq;
>  public WordsFilter(TokenStream in, Map<String, Integer> worldsFreq)
>  {
>  super(in);
>  this.worldsFreq = worldsFreq;
>  }
>  public final Token next(Token result) throws IOException
>  {
>  byte payLoad = 1;
>  try
>  {
>   result = input.next(result);
>   if(result != null)
>   {
>    String word = String.copyValueOf(result.termBuffer(), 0,
> result.termLength());
>    payLoad = Byte.parseByte(worldsFreq.get(word).toString());
>    result.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) }));
>    return result;
>   }
>   else
>   {
>    return null;
>   }
>  }
>  catch(Exception e)
>  {
>   e.printStackTrace();
>   System.out.println(result.termBuffer() + " " + payLoad);
>   FileUtil.writeToFile("IndexProblems.txt", "WordsFilter problem for " +
> result.termBuffer() + " " + payLoad + " : " + e.getStackTrace());
>   return null;
>  }
>  }
> }
> *****
> public class *WordsSimilarity* extends DefaultSimilarity
> {
>  public WordsSimilarity()
>  {
>  }
>  public float tf(float freq)
>  {
>  return super.tf(freq); // just wanted to check whether it is called
>  }
>  public float scorePayload(byte[] payload, int offset, int length)
>  {
>  //  if(length == 1)
>  //  {
>  return payload[offset];
>  //  }
>  }
> }
>
> **
> *******
> ************
> For the synonyms with the weights, I tried the following code:
>   BooleanQuery bq = new BooleanQuery();
>   TermQuery tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "3"));
>   tq.setBoost((float) 1.0);
>   bq.add(bq, BooleanClause.Occur.MUST);
>   tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "2"));
>   tq.setBoost((float) 0.5);
>   bq.add(bq, BooleanClause.Occur.SHOULD);
>   IndexSearcher searcher1 = new IndexSearcher("TestSearchIndex");
>   Hits hits1 = searcher1.search(bq);
>
> And got the error: any idea what is the problem?
>
> at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:385)
>  at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:385)
> Process exited.
>
> Thanks,
> Liat
>
>
> 2009/4/21 Eran Sevi <eransevi@gmail.com>
>
> > Hi,
> >
> > You might want to take a look at Payloads. If you know the frequency of
> the
> > words in each world in advance than during tokenization for each world
> you
> > could save the frequency as the payload.
> >
> > During searches you could use BoostingTermQuery to take the frequency
> into
> > account.
> >
> > Eran.
> >  On Tue, Apr 21, 2009 at 4:44 PM, liat oren <oren.liat@gmail.com> wrote:
> >
> > > Hi Doron,
> > >
> > > Thank you very much for the elaborated answer!
> > >
> > > About the Synonyms, I can't use Wordnet as I have my own list of
> > synonyms.
> > > I
> > > will look at contrib/memory and see what it does.
> > >
> > > You understood correctly the process of using the inverse doc. About
> the
> > > two
> > > problems you mentioned: scalability and ignoring the vicinity of words
> -
> > > scalability - this is the reason I wanted to set the frequencies of the
> > > terms. The use of the frequencies will be used at this stage, not at
> the
> > > stage of using the synonyms. When I use the sysnonyms, I want to use
> the
> > > score as you suggested below.
> > > Here, I have for every word, in which worlds they appear. Currently
> every
> > > world appears once in a word. However, I would like it to appear the
> > number
> > > if times as the frequency of the word in the world. In order to avoid
> > > writing the world several times in the world field, I would like to be
> > able
> > > to set the freq of the specific world accordng to the freq of the word
> at
> > > this world without actually writing it x times (for scalability and
> index
> > > size and performance issues)
> > > So if dog appears 10 times in world 1 and 5 times in world 2, and cat
> > > appears 5 times in world 1, then I want these frequencies to be taken
> > into
> > > account when computing how the word dog and cat are close. BUT I don't
> > want
> > > to write world 1 10 times in word dog and 5 times in word cat, but only
> > > once
> > > and to update the termVector so that the frequency will get 10 and 5
> > > respectively.
> > > So the *generation* of the synonyms will take into account the
> > frequencies
> > >
> > > The vicinity of words - is there any better way to take it in account?
> > >
> > > About the suggestion of using term boosting that will use the score of
> > the
> > > synonyms - if I want to query "big white dogs" and I have the following
> > > synonyms:
> > > big - big (1.0), large (0.9), huge (0.6)
> > > white - white (1.0), color (0.5) offwhite (0.8)
> > > dog - dog (1.0)
> > > So this is the way to do it? :
> > >
> > >   BooleanQuery bq = new  BooleanQuery();
> > >          TermQuery tq = new TermQuery(new Term("text", "big"));
> > >          tq.setBoost((float)1.0);
> > >                   bq.add(bq, false, false);
> > >   tq = new TermQuery(new Term(("text", "large"));
> > >             tq.setBoost((float)0.9);
> > >                             bq.add(bq, false);
> > >  tq = new TermQuery(new Term(("text", "huge"));
> > >             tq.setBoost((float)0.6);
> > >                             bq.add(bq, false);
> > >
> > >  tq = new TermQuery(new Term(("text", "white"));
> > >             tq.setBoost((float)1.0);
> > >                             bq.add(bq, false);
> > >  tq = new TermQuery(new Term(("text", "color"));
> > >             tq.setBoost((float)0.5);
> > >                             bq.add(bq, false);
> > > // etc
> > >   IndexSearcher searcher = new IndexSearcher("TestSearchIndex");
> > >                   Hits hits = searcher.search(bq);
> > >
> > >
> > > how the use of booleanQuery will also look at the position of the
> words?
> > I
> > > remember I read about the score that takes into account also  the
> > position
> > > of the term, but I didn't see this factor in the score formula
> > > Thanks again, it is very helpful,
> > >  Liat
> > > 2009/4/21 Doron Cohen <cdoronc@gmail.com>
> > >
> > > > Hi Liat, there are two packages under Lucene's contrib that deals
> with
> > > > Synonyms - that is contrib/memory and contrib/wordnet - which you
> > > > may find useful. I never used these two but they seem relevant to
> what
> > > > you are trying to achieve.
> > > >
> > > > Anyhow, it seems you compute the synonyms for word w are those
> > > > that appear in the same set of documents ('worlds') as w, and you
> find
> > > > this set by (a) indexing an inverse of the collection (docs become
> > words
> > > > and words become docs) and (b) using docs(w) as query do find
> syns(w).
> > > >
> > > > I assume that your 'worlds' are small, each containing only a small
> > > > set of a few related words, otherwise I would have two
> > > > concerns with this approach: (a) scalability (b) in a large doc
> (world)
> > > > this
> > > > approach ignores the vicinity of words which seems to me important
> > > > to their likelihood as synonyms
> > > >
> > > > Assuming you are okay here, and going back to original question of
> > > > altering the term frequency, perhaps taking the (search) scores of
> the
> > > > returned synonyms (which you find by search) is better than just
> > > > using their frequency? If you find this approach valid, then at least
> > for
> > > > some queries you should be able to use queries boosts. For example
> > > > create a BooleanQuery, add to it a TermQuery for each synonym,
> > > > but set the boost of the TermQuery according to the synonnym score.
> > > > This is also where you could "punish" synnonyms comparing to the
> > > > original word. This will only help with queries with contruction API
> > > > that takes (sub) queries as input (so it will not help with a
> > > PhraseQuery).
> > > >
> > > > - Doron
> > > >
> > > > On Tue, Apr 21, 2009 at 12:40 PM, liat oren <oren.liat@gmail.com>
> > wrote:
> > > >
> > > > > Ok, I will explain the full 'problem' and then explain how I
> approach
> > > it:
> > > > >
> > > > > Lets divide it into three steps:
> > > > >
> > > > > 1. I have a 'dictionary' of words - for every word, I have a list
> of
> > > > > worlds,
> > > > > which are ids of text documents that the word appears in.
> > > > > So, for example, for the word 'dog', I have '1 1600 36000' in the
> > > > "worlds"
> > > > > field (which are tokenized whin indexed) - which means that the
> word
> > > dog
> > > > > appears in worlds 1, 1600 and 36000.
> > > > >
> > > > > 2. This index is used to choose synonyms for the word dog - using
> the
> > > > > "worlds" field - I do a search on this index, giving the query "'1
> > 1600
> > > > > 36000" as in input and thus get the words that are close to the
> word
> > > > "dog".
> > > > > I take the 10 closest words.
> > > > >
> > > > > 3. These 10 synonyms are then used to expand the query.
> > > > >
> > > > > Basically, I have 2 problems in this process:
> > > > >
> > > > > a. In the process of finding the synonyms, I would like that the
> > > > frequency
> > > > > of the word in each of the worlds will be taken into account. so
> that
> > > if
> > > > > 'dog' appeared 3 times in world 1, 10 times in world 1600 and 4
> times
> > > in
> > > > > world 36000, then it will be taken into account.
> > > > > I wanted to avoid "expanding" the field to be "1 1 1 1600 1600 1600
> > > 1600
> > > > > 1600 1600 1600 1600 1600 1600 36000 36000 36000 36000". Accordingly
> I
> > > > > wanted
> > > > > to be able to set the freq by myself.
> > > > >
> > > > > b. In the process of using the synonyms, I wanted to be able to set
> a
> > > > > 'penalty' factor to the synonyms, together with giving differnt
> > weight
> > > to
> > > > > differnt synonyms, according to theur score. I looked at an old
> > thread
> > > -
> > > > > Search for synonyms - implemenetation for review :
> > > > > .
> > > > >
> > > > >
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200603.mbox/%3c39B0FB508E5D7540ACA5AD57225E150D39203D@xmail.me.corp.entopia.com%3e
> > > > >
> > > > > I don;t know if its part of lucene now. I didn't quite understand
> how
> > > to
> > > > > use
> > > > > it.
> > > > > Is there a better way to approach it?
> > > > >
> > > > > I hope I explained it well.
> > > > > Thanks,
> > > > > Liat
> > > > >
> > > > >
> > > > >
> > > > > 2009/4/21 Doron Cohen <cdoronc@gmail.com>
> > > > >
> > > > > > Depending on the problem you are trying to solve there may be
> other
> > > > > > solutions to it, not requiring setting wrong (?) values for
term
> > > > > > frequencies.
> > > > > > If you can explain what you are trying to solve, people on the
> list
> > > may
> > > > > > be able to suggest such alternatives.
> > > > > > - Doron
> > > > > >
> > > > > > On Sun, Apr 19, 2009 at 2:39 PM, liat oren <oren.liat@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I would like to be able to set the term freq to differnt
values
> > at
> > > > > index
> > > > > > > time, or at search time.
> > > > > > >
> > > > > > > So if a document has the following text: 1 2, the freq
of 1
> will
> > > get
> > > > > 100
> > > > > > > and
> > > > > > > the freq of 2 will get 200. I want to avoid expanding it
by
> > writing
> > > 1
> > > > > 100
> > > > > > > times.
> > > > > > >
> > > > > > > I looked at Similarity class and wanted to override it,
but the
> > tf
> > > > > > function
> > > > > > > gets only freq, so I don't know for which term this freq
> relates
> > > to,
> > > > > thus
> > > > > > I
> > > > > > > can't change the value.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Liat
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message