Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 81004 invoked from network); 22 Apr 2009 13:37:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 22 Apr 2009 13:37:43 -0000 Received: (qmail 98955 invoked by uid 500); 22 Apr 2009 13:37:40 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 98884 invoked by uid 500); 22 Apr 2009 13:37:40 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 98874 invoked by uid 99); 22 Apr 2009 13:37:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Apr 2009 13:37:39 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of oren.liat@gmail.com designates 209.85.219.179 as permitted sender) Received: from [209.85.219.179] (HELO mail-ew0-f179.google.com) (209.85.219.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Apr 2009 13:37:33 +0000 Received: by ewy27 with SMTP id 27so3046293ewy.5 for ; Wed, 22 Apr 2009 06:37:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=lThpd1JUX5Is1QDX0UulVg3xEh4yTmXbEsmdoJxIqzI=; b=BGs2eg6sJGxz0fsRF8ck94E1wDnVCTSOkRMpsaCoXqZz7q3Ty2nswld8wN8GME82t6 ZnH1IXK/W5hsxlREv5uZklaswppKLCXjX9g/ibEAHdfC/ThyIvdhhHhJg94rSvHgEv+H +KHBvlD7YI0FLEdxB07G8rtSQW8cD1ay5zfM4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=slfilcu6Pd8eGzgqsjlWv6vF/970CrQM2araKecHjU70hpO+eqiZqrOryCXfYg45Jm nlKQG5Lz82UAToUIC+JaE8Ot5PY7wrxcDTMN653U/vYkX9CxcC/AsSaIxl9pigmOYPS4 9/s1oS56miucJGk406nvcyZLZhyAX3NfVQc+k= MIME-Version: 1.0 Received: by 10.210.53.5 with SMTP id b5mr4144044eba.95.1240407431539; Wed, 22 Apr 2009 06:37:11 -0700 (PDT) In-Reply-To: <74f928500904220552g5189e12ahe19581fa739447e8@mail.gmail.com> References: <74f928500904210740s76e3d63evf6b1629e979858e1@mail.gmail.com> <74f928500904220552g5189e12ahe19581fa739447e8@mail.gmail.com> Date: Wed, 22 Apr 2009 16:37:11 +0300 Message-ID: Subject: Re: changing term freq in indexing time From: liat oren To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0015174be18a556d38046824dc4a X-Virus-Checked: Checked by ClamAV on apache.org --0015174be18a556d38046824dc4a Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit The reason I am searching "3 3 2 1" and not "3 2 1" is the reason I asked the question - it is important to include also these frequencies into account when generating these scores. Look at it as if - if a word appears more frequently in a text, is it more important. I managed to make the boosting work, but it seems like it has a very minor factor within the scoring formula. I tried this one: BooleanQuery bq = new BooleanQuery(); TermQuery tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "3")); tq.setBoost((float) 1.0); bq.add(tq, BooleanClause.Occur.MUST); tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "6")); tq.setBoost((float) 5); bq.add(tq, BooleanClause.Occur.SHOULD); and then with boost 0.5 for "6". The change in the score was very minor ******************************************************** **** 0.03954939 3 ***** - the score (normalized) 0.73506767 = (MATCH) sum of: 0.035917655 = (MATCH) weight(worlds:3 in 1), product of: 0.10084726 = queryWeight(worlds:3), product of: 0.71231794 = idf(docFreq=3, numDocs=3) 0.14157619 = queryNorm 0.35615897 = (MATCH) fieldWeight(worlds:3 in 1), product of: 1.0 = tf(termFreq(worlds:3)=1) 0.71231794 = idf(docFreq=3, numDocs=3) 0.5 = fieldNorm(field=worlds, doc=1) 0.69915 = (MATCH) weight(worlds:6^5.0 in 1), product of: * 0.99490196 = queryWeight(worlds:6^5.0), product of: 5.0 = boost * 1.4054651 = idf(docFreq=1, numDocs=3) 0.14157619 = queryNorm 0.70273256 = (MATCH) fieldWeight(worlds:6 in 1), product of: 1.0 = tf(termFreq(worlds:6)=1) 1.4054651 = idf(docFreq=1, numDocs=3) 0.5 = fieldNorm(field=worlds, doc=1) compares to ******************************************************** **** 0.03954939 3 ***** - the score 0.74707216 = (MATCH) sum of: 0.25354254 = (MATCH) weight(worlds:3 in 1), product of: 0.71188027 = queryWeight(worlds:3), product of: 0.71231794 = idf(docFreq=3, numDocs=3) 0.9993856 = queryNorm 0.35615897 = (MATCH) fieldWeight(worlds:3 in 1), product of: 1.0 = tf(termFreq(worlds:3)=1) 0.71231794 = idf(docFreq=3, numDocs=3) 0.5 = fieldNorm(field=worlds, doc=1) 0.49352962 = (MATCH) weight(worlds:6^0.5 in 1), product of: * 0.7023008 = queryWeight(worlds:6^0.5), product of: 0.5 = boost * 1.4054651 = idf(docFreq=1, numDocs=3) 0.9993856 = queryNorm 0.70273256 = (MATCH) fieldWeight(worlds:6 in 1), product of: 1.0 = tf(termFreq(worlds:6)=1) 1.4054651 = idf(docFreq=1, numDocs=3) 0.5 = fieldNorm(field=worlds, doc=1) Any idea why is it? It is not possible to set the frequencies during index time (this will give a much bigger affect)? Thanks, Liat 2009/4/22 Eran Sevi > Hi, > I'm no expert on the subject but it seems like you're searching for one > term > that should be "3 3 2 1" (why do you write "3" two times anyway?). > I think you should try a regulalr boolean query where each sub-query is a > BoostingTermQuery on one term only. These queries should be used with > Occur.MUST if you want the word to be in all these worlds. > Maybe you should search the archives on the proper use of Boosting*Query. > Regarding the synonyms - it looks quite OK to me. Maybe you should try to > use ony Occur.MUST for all TermQuery instances. A simple debugging should > also give you some clue about what is the problem. > Good luck, Eran. > > On Wed, Apr 22, 2009 at 1:52 PM, liat oren wrote: > > > Thanks Eran, I tried it, adding the classes I copied below and tried to > run > > the following > > code: > > > > [Also I have below a question about the usage of synonyms and > > BooleanQuery.] > > > > DoubleMap wordMap = new DoubleMap(); > > wordMap.insert("1", 1, 5); // for word "1" we have the world 1, 5 times > > wordMap.insert("1", 2, 2);// for word "1" we have the world 2, 2 times > > wordMap.insert("1", 3, 7); > > wordMap.insert("1", 4, 1); > > wordMap.insert("2", 3, 1); // for word "2" we have the world 3, 1 time > > wordMap.insert("2", 5, 1); > > wordMap.insert("2", 6, 1); > > wordMap.insert("3", 3, 1); > > wordMap.insert("3", 4, 1); > > wordMap.insert("3", 8, 1); > > ioManager io = new ioManager(); > > io.index(wordMap, "TestSearchIndex", "", "1"); > > > > IndexSearcher searcher = new IndexSearcher("TestSearchIndex"); > > searcher.setSimilarity(new WordsSimilarity()); // WordsSimilarity is > > written below > > Query btq = new BoostingTermQuery(new Term(WordIndex.FIELD_WORLDS, "3 3 > 2 > > 1")); > > Hits wordsHits = searcher.search(btq); > > > > From some reason the hits size is 0 and none of the methods overriden in > > WordsSimilarity is called (I put a breakpoint and it didn;t get there > > during > > search time) > > > > public class *WordsAnalyzer* extends Analyzer > > { > > public Map> wordsWorldsFreq = new > > HashMap>(); > > public Map worldsFreq = new HashMap(); > > public WordsAnalyzer() > > { > > } > > public WordsAnalyzer(Map worldsFreq) throws IOException > > { > > this.worldsFreq = worldsFreq; > > } > > public TokenStream tokenStream(String fieldName, Reader reader) > > { > > return new WordsFilter(new StandardTokenizer(reader), worldsFreq); > > } > > } > > > > public class *WordsFilter* extends TokenFilter > > { > > public Map worldsFreq; > > public WordsFilter(TokenStream in, Map worldsFreq) > > { > > super(in); > > this.worldsFreq = worldsFreq; > > } > > public final Token next(Token result) throws IOException > > { > > byte payLoad = 1; > > try > > { > > result = input.next(result); > > if(result != null) > > { > > String word = String.copyValueOf(result.termBuffer(), 0, > > result.termLength()); > > payLoad = Byte.parseByte(worldsFreq.get(word).toString()); > > result.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) })); > > return result; > > } > > else > > { > > return null; > > } > > } > > catch(Exception e) > > { > > e.printStackTrace(); > > System.out.println(result.termBuffer() + " " + payLoad); > > FileUtil.writeToFile("IndexProblems.txt", "WordsFilter problem for " + > > result.termBuffer() + " " + payLoad + " : " + e.getStackTrace()); > > return null; > > } > > } > > } > > ***** > > public class *WordsSimilarity* extends DefaultSimilarity > > { > > public WordsSimilarity() > > { > > } > > public float tf(float freq) > > { > > return super.tf(freq); // just wanted to check whether it is called > > } > > public float scorePayload(byte[] payload, int offset, int length) > > { > > // if(length == 1) > > // { > > return payload[offset]; > > // } > > } > > } > > > > ** > > ******* > > ************ > > For the synonyms with the weights, I tried the following code: > > BooleanQuery bq = new BooleanQuery(); > > TermQuery tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "3")); > > tq.setBoost((float) 1.0); > > bq.add(bq, BooleanClause.Occur.MUST); > > tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "2")); > > tq.setBoost((float) 0.5); > > bq.add(bq, BooleanClause.Occur.SHOULD); > > IndexSearcher searcher1 = new IndexSearcher("TestSearchIndex"); > > Hits hits1 = searcher1.search(bq); > > > > And got the error: any idea what is the problem? > > > > at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:385) > > at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:385) > > Process exited. > > > > Thanks, > > Liat > > > > > > 2009/4/21 Eran Sevi > > > > > Hi, > > > > > > You might want to take a look at Payloads. If you know the frequency of > > the > > > words in each world in advance than during tokenization for each world > > you > > > could save the frequency as the payload. > > > > > > During searches you could use BoostingTermQuery to take the frequency > > into > > > account. > > > > > > Eran. > > > On Tue, Apr 21, 2009 at 4:44 PM, liat oren > wrote: > > > > > > > Hi Doron, > > > > > > > > Thank you very much for the elaborated answer! > > > > > > > > About the Synonyms, I can't use Wordnet as I have my own list of > > > synonyms. > > > > I > > > > will look at contrib/memory and see what it does. > > > > > > > > You understood correctly the process of using the inverse doc. About > > the > > > > two > > > > problems you mentioned: scalability and ignoring the vicinity of > words > > - > > > > scalability - this is the reason I wanted to set the frequencies of > the > > > > terms. The use of the frequencies will be used at this stage, not at > > the > > > > stage of using the synonyms. When I use the sysnonyms, I want to use > > the > > > > score as you suggested below. > > > > Here, I have for every word, in which worlds they appear. Currently > > every > > > > world appears once in a word. However, I would like it to appear the > > > number > > > > if times as the frequency of the word in the world. In order to avoid > > > > writing the world several times in the world field, I would like to > be > > > able > > > > to set the freq of the specific world accordng to the freq of the > word > > at > > > > this world without actually writing it x times (for scalability and > > index > > > > size and performance issues) > > > > So if dog appears 10 times in world 1 and 5 times in world 2, and cat > > > > appears 5 times in world 1, then I want these frequencies to be taken > > > into > > > > account when computing how the word dog and cat are close. BUT I > don't > > > want > > > > to write world 1 10 times in word dog and 5 times in word cat, but > only > > > > once > > > > and to update the termVector so that the frequency will get 10 and 5 > > > > respectively. > > > > So the *generation* of the synonyms will take into account the > > > frequencies > > > > > > > > The vicinity of words - is there any better way to take it in > account? > > > > > > > > About the suggestion of using term boosting that will use the score > of > > > the > > > > synonyms - if I want to query "big white dogs" and I have the > following > > > > synonyms: > > > > big - big (1.0), large (0.9), huge (0.6) > > > > white - white (1.0), color (0.5) offwhite (0.8) > > > > dog - dog (1.0) > > > > So this is the way to do it? : > > > > > > > > BooleanQuery bq = new BooleanQuery(); > > > > TermQuery tq = new TermQuery(new Term("text", "big")); > > > > tq.setBoost((float)1.0); > > > > bq.add(bq, false, false); > > > > tq = new TermQuery(new Term(("text", "large")); > > > > tq.setBoost((float)0.9); > > > > bq.add(bq, false); > > > > tq = new TermQuery(new Term(("text", "huge")); > > > > tq.setBoost((float)0.6); > > > > bq.add(bq, false); > > > > > > > > tq = new TermQuery(new Term(("text", "white")); > > > > tq.setBoost((float)1.0); > > > > bq.add(bq, false); > > > > tq = new TermQuery(new Term(("text", "color")); > > > > tq.setBoost((float)0.5); > > > > bq.add(bq, false); > > > > // etc > > > > IndexSearcher searcher = new IndexSearcher("TestSearchIndex"); > > > > Hits hits = searcher.search(bq); > > > > > > > > > > > > how the use of booleanQuery will also look at the position of the > > words? > > > I > > > > remember I read about the score that takes into account also the > > > position > > > > of the term, but I didn't see this factor in the score formula > > > > Thanks again, it is very helpful, > > > > Liat > > > > 2009/4/21 Doron Cohen > > > > > > > > > Hi Liat, there are two packages under Lucene's contrib that deals > > with > > > > > Synonyms - that is contrib/memory and contrib/wordnet - which you > > > > > may find useful. I never used these two but they seem relevant to > > what > > > > > you are trying to achieve. > > > > > > > > > > Anyhow, it seems you compute the synonyms for word w are those > > > > > that appear in the same set of documents ('worlds') as w, and you > > find > > > > > this set by (a) indexing an inverse of the collection (docs become > > > words > > > > > and words become docs) and (b) using docs(w) as query do find > > syns(w). > > > > > > > > > > I assume that your 'worlds' are small, each containing only a small > > > > > set of a few related words, otherwise I would have two > > > > > concerns with this approach: (a) scalability (b) in a large doc > > (world) > > > > > this > > > > > approach ignores the vicinity of words which seems to me important > > > > > to their likelihood as synonyms > > > > > > > > > > Assuming you are okay here, and going back to original question of > > > > > altering the term frequency, perhaps taking the (search) scores of > > the > > > > > returned synonyms (which you find by search) is better than just > > > > > using their frequency? If you find this approach valid, then at > least > > > for > > > > > some queries you should be able to use queries boosts. For example > > > > > create a BooleanQuery, add to it a TermQuery for each synonym, > > > > > but set the boost of the TermQuery according to the synonnym score. > > > > > This is also where you could "punish" synnonyms comparing to the > > > > > original word. This will only help with queries with contruction > API > > > > > that takes (sub) queries as input (so it will not help with a > > > > PhraseQuery). > > > > > > > > > > - Doron > > > > > > > > > > On Tue, Apr 21, 2009 at 12:40 PM, liat oren > > > wrote: > > > > > > > > > > > Ok, I will explain the full 'problem' and then explain how I > > approach > > > > it: > > > > > > > > > > > > Lets divide it into three steps: > > > > > > > > > > > > 1. I have a 'dictionary' of words - for every word, I have a list > > of > > > > > > worlds, > > > > > > which are ids of text documents that the word appears in. > > > > > > So, for example, for the word 'dog', I have '1 1600 36000' in the > > > > > "worlds" > > > > > > field (which are tokenized whin indexed) - which means that the > > word > > > > dog > > > > > > appears in worlds 1, 1600 and 36000. > > > > > > > > > > > > 2. This index is used to choose synonyms for the word dog - using > > the > > > > > > "worlds" field - I do a search on this index, giving the query > "'1 > > > 1600 > > > > > > 36000" as in input and thus get the words that are close to the > > word > > > > > "dog". > > > > > > I take the 10 closest words. > > > > > > > > > > > > 3. These 10 synonyms are then used to expand the query. > > > > > > > > > > > > Basically, I have 2 problems in this process: > > > > > > > > > > > > a. In the process of finding the synonyms, I would like that the > > > > > frequency > > > > > > of the word in each of the worlds will be taken into account. so > > that > > > > if > > > > > > 'dog' appeared 3 times in world 1, 10 times in world 1600 and 4 > > times > > > > in > > > > > > world 36000, then it will be taken into account. > > > > > > I wanted to avoid "expanding" the field to be "1 1 1 1600 1600 > 1600 > > > > 1600 > > > > > > 1600 1600 1600 1600 1600 1600 36000 36000 36000 36000". > Accordingly > > I > > > > > > wanted > > > > > > to be able to set the freq by myself. > > > > > > > > > > > > b. In the process of using the synonyms, I wanted to be able to > set > > a > > > > > > 'penalty' factor to the synonyms, together with giving differnt > > > weight > > > > to > > > > > > differnt synonyms, according to theur score. I looked at an old > > > thread > > > > - > > > > > > Search for synonyms - implemenetation for review : > > > > > > . > > > > > > > > > > > > > > > > > > > > > > > > > > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200603.mbox/%3c39B0FB508E5D7540ACA5AD57225E150D39203D@xmail.me.corp.entopia.com%3e > > > > > > > > > > > > I don;t know if its part of lucene now. I didn't quite understand > > how > > > > to > > > > > > use > > > > > > it. > > > > > > Is there a better way to approach it? > > > > > > > > > > > > I hope I explained it well. > > > > > > Thanks, > > > > > > Liat > > > > > > > > > > > > > > > > > > > > > > > > 2009/4/21 Doron Cohen > > > > > > > > > > > > > Depending on the problem you are trying to solve there may be > > other > > > > > > > solutions to it, not requiring setting wrong (?) values for > term > > > > > > > frequencies. > > > > > > > If you can explain what you are trying to solve, people on the > > list > > > > may > > > > > > > be able to suggest such alternatives. > > > > > > > - Doron > > > > > > > > > > > > > > On Sun, Apr 19, 2009 at 2:39 PM, liat oren < > oren.liat@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I would like to be able to set the term freq to differnt > values > > > at > > > > > > index > > > > > > > > time, or at search time. > > > > > > > > > > > > > > > > So if a document has the following text: 1 2, the freq of 1 > > will > > > > get > > > > > > 100 > > > > > > > > and > > > > > > > > the freq of 2 will get 200. I want to avoid expanding it by > > > writing > > > > 1 > > > > > > 100 > > > > > > > > times. > > > > > > > > > > > > > > > > I looked at Similarity class and wanted to override it, but > the > > > tf > > > > > > > function > > > > > > > > gets only freq, so I don't know for which term this freq > > relates > > > > to, > > > > > > thus > > > > > > > I > > > > > > > > can't change the value. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Liat > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --0015174be18a556d38046824dc4a--