Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of oren.liat@gmail.com
 designates 209.85.219.179 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=slfilcu6Pd8eGzgqsjlWv6vF/970CrQM2araKecHjU70hpO+eqiZqrOryCXfYg45Jm
         nlKQG5Lz82UAToUIC+JaE8Ot5PY7wrxcDTMN653U/vYkX9CxcC/AsSaIxl9pigmOYPS4
         9/s1oS56miucJGk406nvcyZLZhyAX3NfVQc+k=
MIME-Version: 1.0
In-Reply-To: <74f928500904220552g5189e12ahe19581fa739447e8@mail.gmail.com>
References: <d36c28c30904190439g477f57e9s52c48ecffca5881e@mail.gmail.com>
	 <e05f0fd10904210216q3c0ed85egb097dde96a0e5ab3@mail.gmail.com>
	 <d36c28c30904210240p1271d5aahf39081570c4f7376@mail.gmail.com>
	 <e05f0fd10904210508q59bd86b1j479aac3b1be41f43@mail.gmail.com>
	 <d36c28c30904210644x2ba3c6f6mbd5c306def48078b@mail.gmail.com>
	 <74f928500904210740s76e3d63evf6b1629e979858e1@mail.gmail.com>
	 <d36c28c30904220352o57066e8egd3928a2390905f90@mail.gmail.com>
	 <74f928500904220552g5189e12ahe19581fa739447e8@mail.gmail.com>
Date: Wed, 22 Apr 2009 16:37:11 +0300
Message-ID: <d36c28c30904220637r102429c1m1cdb45924005879f@mail.gmail.com>
Subject: Re: changing term freq in indexing time
From: liat oren <oren.liat@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=0015174be18a556d38046824dc4a

--0015174be18a556d38046824dc4a
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

The reason I am searching "3 3 2 1" and not "3 2 1" is the reason I asked
the question - it is important to include also these frequencies into
account when generating these scores.
Look at it as if - if a word appears more frequently in a text, is it more
important.

I managed to make the boosting work, but it seems like it has a very minor
factor within the scoring formula.

I tried this one:
   BooleanQuery bq = new BooleanQuery();
   TermQuery tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "3"));
   tq.setBoost((float) 1.0);
   bq.add(tq, BooleanClause.Occur.MUST);
   tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "6"));
   tq.setBoost((float) 5);
   bq.add(tq, BooleanClause.Occur.SHOULD);

and then with boost 0.5 for "6".
The change in the score was very minor

********************************************************
**** 0.03954939 3 ***** - the score (normalized)
0.73506767 = (MATCH) sum of:
  0.035917655 = (MATCH) weight(worlds:3 in 1), product of:
    0.10084726 = queryWeight(worlds:3), product of:
      0.71231794 = idf(docFreq=3, numDocs=3)
      0.14157619 = queryNorm
    0.35615897 = (MATCH) fieldWeight(worlds:3 in 1), product of:
      1.0 = tf(termFreq(worlds:3)=1)
      0.71231794 = idf(docFreq=3, numDocs=3)
      0.5 = fieldNorm(field=worlds, doc=1)
  0.69915 = (MATCH) weight(worlds:6^5.0 in 1), product of:
  *  0.99490196 = queryWeight(worlds:6^5.0), product of:
      5.0 = boost
*      1.4054651 = idf(docFreq=1, numDocs=3)
      0.14157619 = queryNorm
    0.70273256 = (MATCH) fieldWeight(worlds:6 in 1), product of:
      1.0 = tf(termFreq(worlds:6)=1)
      1.4054651 = idf(docFreq=1, numDocs=3)
      0.5 = fieldNorm(field=worlds, doc=1)

compares to
********************************************************
**** 0.03954939 3 ***** - the score
0.74707216 = (MATCH) sum of:
  0.25354254 = (MATCH) weight(worlds:3 in 1), product of:
    0.71188027 = queryWeight(worlds:3), product of:
      0.71231794 = idf(docFreq=3, numDocs=3)
      0.9993856 = queryNorm
    0.35615897 = (MATCH) fieldWeight(worlds:3 in 1), product of:
      1.0 = tf(termFreq(worlds:3)=1)
      0.71231794 = idf(docFreq=3, numDocs=3)
      0.5 = fieldNorm(field=worlds, doc=1)
  0.49352962 = (MATCH) weight(worlds:6^0.5 in 1), product of:
   * 0.7023008 = queryWeight(worlds:6^0.5), product of:
      0.5 = boost
*      1.4054651 = idf(docFreq=1, numDocs=3)
      0.9993856 = queryNorm
    0.70273256 = (MATCH) fieldWeight(worlds:6 in 1), product of:
      1.0 = tf(termFreq(worlds:6)=1)
      1.4054651 = idf(docFreq=1, numDocs=3)
      0.5 = fieldNorm(field=worlds, doc=1)
Any idea why is it?
It is not possible to set the frequencies during index time (this will give
a much bigger affect)?

Thanks,
Liat
2009/4/22 Eran Sevi <eransevi@gmail.com>

> Hi,
> I'm no expert on the subject but it seems like you're searching for one
> term
> that should be "3 3 2 1" (why do you write "3" two times anyway?).
> I think you should try a regulalr boolean query where each sub-query is a
> BoostingTermQuery on one term only. These queries should be used with
> Occur.MUST if you want the word to be in all these worlds.
> Maybe you should search the archives on the proper use of Boosting*Query.
> Regarding the synonyms - it looks quite OK to me. Maybe you should try to
> use ony Occur.MUST for all TermQuery instances. A simple debugging should
> also give you some clue about what is the problem.
> Good luck, Eran.
>
> On Wed, Apr 22, 2009 at 1:52 PM, liat oren <oren.liat@gmail.com> wrote:
>
> > Thanks Eran, I tried it, adding the classes I copied below and tried to
> run
> > the following
> > code:
> >
> > [Also I have below a question about the usage of synonyms and
> > BooleanQuery.]
> >
> >  DoubleMap wordMap = new DoubleMap();
> >  wordMap.insert("1", 1, 5); // for word "1" we have the world 1, 5 times
> >  wordMap.insert("1", 2, 2);// for word "1" we have the world 2, 2 times
> >  wordMap.insert("1", 3, 7);
> >  wordMap.insert("1", 4, 1);
> >  wordMap.insert("2", 3, 1); // for word "2" we have the world 3, 1 time
> >  wordMap.insert("2", 5, 1);
> >  wordMap.insert("2", 6, 1);
> >  wordMap.insert("3", 3, 1);
> >  wordMap.insert("3", 4, 1);
> >  wordMap.insert("3", 8, 1);
> >  ioManager io = new ioManager();
> >  io.index(wordMap, "TestSearchIndex", "", "1");
> >
> >  IndexSearcher searcher = new IndexSearcher("TestSearchIndex");
> >  searcher.setSimilarity(new WordsSimilarity()); // WordsSimilarity is
> > written below
> >  Query btq = new BoostingTermQuery(new Term(WordIndex.FIELD_WORLDS, "3 3
> 2
> > 1"));
> >  Hits wordsHits = searcher.search(btq);
> >
> > From some reason the hits size is 0 and none of the methods overriden in
> > WordsSimilarity is called (I put a breakpoint and it didn;t get there
> > during
> > search time)
> >
> > public class *WordsAnalyzer* extends Analyzer
> > {
> >  public Map<String, Map<String, Integer>> wordsWorldsFreq = new
> > HashMap<String, Map<String, Integer>>();
> >  public Map<String, Integer> worldsFreq = new HashMap<String, Integer>();
> >  public WordsAnalyzer()
> >  {
> >  }
> >  public WordsAnalyzer(Map<String, Integer> worldsFreq) throws IOException
> >  {
> >  this.worldsFreq = worldsFreq;
> >  }
> >  public TokenStream tokenStream(String fieldName, Reader reader)
> >  {
> >  return new WordsFilter(new StandardTokenizer(reader), worldsFreq);
> >  }
> > }
> >
> > public class *WordsFilter* extends TokenFilter
> > {
> >  public Map<String, Integer> worldsFreq;
> >  public WordsFilter(TokenStream in, Map<String, Integer> worldsFreq)
> >  {
> >  super(in);
> >  this.worldsFreq = worldsFreq;
> >  }
> >  public final Token next(Token result) throws IOException
> >  {
> >  byte payLoad = 1;
> >  try
> >  {
> >   result = input.next(result);
> >   if(result != null)
> >   {
> >    String word = String.copyValueOf(result.termBuffer(), 0,
> > result.termLength());
> >    payLoad = Byte.parseByte(worldsFreq.get(word).toString());
> >    result.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) }));
> >    return result;
> >   }
> >   else
> >   {
> >    return null;
> >   }
> >  }
> >  catch(Exception e)
> >  {
> >   e.printStackTrace();
> >   System.out.println(result.termBuffer() + " " + payLoad);
> >   FileUtil.writeToFile("IndexProblems.txt", "WordsFilter problem for " +
> > result.termBuffer() + " " + payLoad + " : " + e.getStackTrace());
> >   return null;
> >  }
> >  }
> > }
> > *****
> > public class *WordsSimilarity* extends DefaultSimilarity
> > {
> >  public WordsSimilarity()
> >  {
> >  }
> >  public float tf(float freq)
> >  {
> >  return super.tf(freq); // just wanted to check whether it is called
> >  }
> >  public float scorePayload(byte[] payload, int offset, int length)
> >  {
> >  //  if(length == 1)
> >  //  {
> >  return payload[offset];
> >  //  }
> >  }
> > }
> >
> > **
> > *******
> > ************
> > For the synonyms with the weights, I tried the following code:
> >   BooleanQuery bq = new BooleanQuery();
> >   TermQuery tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "3"));
> >   tq.setBoost((float) 1.0);
> >   bq.add(bq, BooleanClause.Occur.MUST);
> >   tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "2"));
> >   tq.setBoost((float) 0.5);
> >   bq.add(bq, BooleanClause.Occur.SHOULD);
> >   IndexSearcher searcher1 = new IndexSearcher("TestSearchIndex");
> >   Hits hits1 = searcher1.search(bq);
> >
> > And got the error: any idea what is the problem?
> >
> > at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:385)
> >  at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:385)
> > Process exited.
> >
> > Thanks,
> > Liat
> >
> >
> > 2009/4/21 Eran Sevi <eransevi@gmail.com>
> >
> > > Hi,
> > >
> > > You might want to take a look at Payloads. If you know the frequency of
> > the
> > > words in each world in advance than during tokenization for each world
> > you
> > > could save the frequency as the payload.
> > >
> > > During searches you could use BoostingTermQuery to take the frequency
> > into
> > > account.
> > >
> > > Eran.
> > >  On Tue, Apr 21, 2009 at 4:44 PM, liat oren <oren.liat@gmail.com>
> wrote:
> > >
> > > > Hi Doron,
> > > >
> > > > Thank you very much for the elaborated answer!
> > > >
> > > > About the Synonyms, I can't use Wordnet as I have my own list of
> > > synonyms.
> > > > I
> > > > will look at contrib/memory and see what it does.
> > > >
> > > > You understood correctly the process of using the inverse doc. About
> > the
> > > > two
> > > > problems you mentioned: scalability and ignoring the vicinity of
> words
> > -
> > > > scalability - this is the reason I wanted to set the frequencies of
> the
> > > > terms. The use of the frequencies will be used at this stage, not at
> > the
> > > > stage of using the synonyms. When I use the sysnonyms, I want to use
> > the
> > > > score as you suggested below.
> > > > Here, I have for every word, in which worlds they appear. Currently
> > every
> > > > world appears once in a word. However, I would like it to appear the
> > > number
> > > > if times as the frequency of the word in the world. In order to avoid
> > > > writing the world several times in the world field, I would like to
> be
> > > able
> > > > to set the freq of the specific world accordng to the freq of the
> word
> > at
> > > > this world without actually writing it x times (for scalability and
> > index
> > > > size and performance issues)
> > > > So if dog appears 10 times in world 1 and 5 times in world 2, and cat
> > > > appears 5 times in world 1, then I want these frequencies to be taken
> > > into
> > > > account when computing how the word dog and cat are close. BUT I
> don't
> > > want
> > > > to write world 1 10 times in word dog and 5 times in word cat, but
> only
> > > > once
> > > > and to update the termVector so that the frequency will get 10 and 5
> > > > respectively.
> > > > So the *generation* of the synonyms will take into account the
> > > frequencies
> > > >
> > > > The vicinity of words - is there any better way to take it in
> account?
> > > >
> > > > About the suggestion of using term boosting that will use the score
> of
> > > the
> > > > synonyms - if I want to query "big white dogs" and I have the
> following
> > > > synonyms:
> > > > big - big (1.0), large (0.9), huge (0.6)
> > > > white - white (1.0), color (0.5) offwhite (0.8)
> > > > dog - dog (1.0)
> > > > So this is the way to do it? :
> > > >
> > > >   BooleanQuery bq = new  BooleanQuery();
> > > >          TermQuery tq = new TermQuery(new Term("text", "big"));
> > > >          tq.setBoost((float)1.0);
> > > >                   bq.add(bq, false, false);
> > > >   tq = new TermQuery(new Term(("text", "large"));
> > > >             tq.setBoost((float)0.9);
> > > >                             bq.add(bq, false);
> > > >  tq = new TermQuery(new Term(("text", "huge"));
> > > >             tq.setBoost((float)0.6);
> > > >                             bq.add(bq, false);
> > > >
> > > >  tq = new TermQuery(new Term(("text", "white"));
> > > >             tq.setBoost((float)1.0);
> > > >                             bq.add(bq, false);
> > > >  tq = new TermQuery(new Term(("text", "color"));
> > > >             tq.setBoost((float)0.5);
> > > >                             bq.add(bq, false);
> > > > // etc
> > > >   IndexSearcher searcher = new IndexSearcher("TestSearchIndex");
> > > >                   Hits hits = searcher.search(bq);
> > > >
> > > >
> > > > how the use of booleanQuery will also look at the position of the
> > words?
> > > I
> > > > remember I read about the score that takes into account also  the
> > > position
> > > > of the term, but I didn't see this factor in the score formula
> > > > Thanks again, it is very helpful,
> > > >  Liat
> > > > 2009/4/21 Doron Cohen <cdoronc@gmail.com>
> > > >
> > > > > Hi Liat, there are two packages under Lucene's contrib that deals
> > with
> > > > > Synonyms - that is contrib/memory and contrib/wordnet - which you
> > > > > may find useful. I never used these two but they seem relevant to
> > what
> > > > > you are trying to achieve.
> > > > >
> > > > > Anyhow, it seems you compute the synonyms for word w are those
> > > > > that appear in the same set of documents ('worlds') as w, and you
> > find
> > > > > this set by (a) indexing an inverse of the collection (docs become
> > > words
> > > > > and words become docs) and (b) using docs(w) as query do find
> > syns(w).
> > > > >
> > > > > I assume that your 'worlds' are small, each containing only a small
> > > > > set of a few related words, otherwise I would have two
> > > > > concerns with this approach: (a) scalability (b) in a large doc
> > (world)
> > > > > this
> > > > > approach ignores the vicinity of words which seems to me important
> > > > > to their likelihood as synonyms
> > > > >
> > > > > Assuming you are okay here, and going back to original question of
> > > > > altering the term frequency, perhaps taking the (search) scores of
> > the
> > > > > returned synonyms (which you find by search) is better than just
> > > > > using their frequency? If you find this approach valid, then at
> least
> > > for
> > > > > some queries you should be able to use queries boosts. For example
> > > > > create a BooleanQuery, add to it a TermQuery for each synonym,
> > > > > but set the boost of the TermQuery according to the synonnym score.
> > > > > This is also where you could "punish" synnonyms comparing to the
> > > > > original word. This will only help with queries with contruction
> API
> > > > > that takes (sub) queries as input (so it will not help with a
> > > > PhraseQuery).
> > > > >
> > > > > - Doron
> > > > >
> > > > > On Tue, Apr 21, 2009 at 12:40 PM, liat oren <oren.liat@gmail.com>
> > > wrote:
> > > > >
> > > > > > Ok, I will explain the full 'problem' and then explain how I
> > approach
> > > > it:
> > > > > >
> > > > > > Lets divide it into three steps:
> > > > > >
> > > > > > 1. I have a 'dictionary' of words - for every word, I have a list
> > of
> > > > > > worlds,
> > > > > > which are ids of text documents that the word appears in.
> > > > > > So, for example, for the word 'dog', I have '1 1600 36000' in the
> > > > > "worlds"
> > > > > > field (which are tokenized whin indexed) - which means that the
> > word
> > > > dog
> > > > > > appears in worlds 1, 1600 and 36000.
> > > > > >
> > > > > > 2. This index is used to choose synonyms for the word dog - using
> > the
> > > > > > "worlds" field - I do a search on this index, giving the query
> "'1
> > > 1600
> > > > > > 36000" as in input and thus get the words that are close to the
> > word
> > > > > "dog".
> > > > > > I take the 10 closest words.
> > > > > >
> > > > > > 3. These 10 synonyms are then used to expand the query.
> > > > > >
> > > > > > Basically, I have 2 problems in this process:
> > > > > >
> > > > > > a. In the process of finding the synonyms, I would like that the
> > > > > frequency
> > > > > > of the word in each of the worlds will be taken into account. so
> > that
> > > > if
> > > > > > 'dog' appeared 3 times in world 1, 10 times in world 1600 and 4
> > times
> > > > in
> > > > > > world 36000, then it will be taken into account.
> > > > > > I wanted to avoid "expanding" the field to be "1 1 1 1600 1600
> 1600
> > > > 1600
> > > > > > 1600 1600 1600 1600 1600 1600 36000 36000 36000 36000".
> Accordingly
> > I
> > > > > > wanted
> > > > > > to be able to set the freq by myself.
> > > > > >
> > > > > > b. In the process of using the synonyms, I wanted to be able to
> set
> > a
> > > > > > 'penalty' factor to the synonyms, together with giving differnt
> > > weight
> > > > to
> > > > > > differnt synonyms, according to theur score. I looked at an old
> > > thread
> > > > -
> > > > > > Search for synonyms - implemenetation for review :
> > > > > > .
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200603.mbox/%3c39B0FB508E5D7540ACA5AD57225E150D39203D@xmail.me.corp.entopia.com%3e
> > > > > >
> > > > > > I don;t know if its part of lucene now. I didn't quite understand
> > how
> > > > to
> > > > > > use
> > > > > > it.
> > > > > > Is there a better way to approach it?
> > > > > >
> > > > > > I hope I explained it well.
> > > > > > Thanks,
> > > > > > Liat
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2009/4/21 Doron Cohen <cdoronc@gmail.com>
> > > > > >
> > > > > > > Depending on the problem you are trying to solve there may be
> > other
> > > > > > > solutions to it, not requiring setting wrong (?) values for
> term
> > > > > > > frequencies.
> > > > > > > If you can explain what you are trying to solve, people on the
> > list
> > > > may
> > > > > > > be able to suggest such alternatives.
> > > > > > > - Doron
> > > > > > >
> > > > > > > On Sun, Apr 19, 2009 at 2:39 PM, liat oren <
> oren.liat@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I would like to be able to set the term freq to differnt
> values
> > > at
> > > > > > index
> > > > > > > > time, or at search time.
> > > > > > > >
> > > > > > > > So if a document has the following text: 1 2, the freq of 1
> > will
> > > > get
> > > > > > 100
> > > > > > > > and
> > > > > > > > the freq of 2 will get 200. I want to avoid expanding it by
> > > writing
> > > > 1
> > > > > > 100
> > > > > > > > times.
> > > > > > > >
> > > > > > > > I looked at Similarity class and wanted to override it, but
> the
> > > tf
> > > > > > > function
> > > > > > > > gets only freq, so I don't know for which term this freq
> > relates
> > > > to,
> > > > > > thus
> > > > > > > I
> > > > > > > > can't change the value.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Liat
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

--0015174be18a556d38046824dc4a--