Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 43114 invoked from network); 4 May 2009 06:43:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 May 2009 06:43:35 -0000 Received: (qmail 18372 invoked by uid 500); 4 May 2009 06:43:32 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 18304 invoked by uid 500); 4 May 2009 06:43:32 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 18294 invoked by uid 99); 4 May 2009 06:43:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2009 06:43:32 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_FONT_SIZE_LARGE,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of oren.liat@gmail.com designates 209.85.219.179 as permitted sender) Received: from [209.85.219.179] (HELO mail-ew0-f179.google.com) (209.85.219.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2009 06:43:25 +0000 Received: by ewy27 with SMTP id 27so3945750ewy.5 for ; Sun, 03 May 2009 23:43:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=hTRTAXu4QSku66FzcKqaJsyZpaRcLu65aIMkS9odFUU=; b=qxEylgmsScCPtYoxns6xflces8UC+Q+RJ+JH5xUkg20iSX8IkZ7YN9qr8r7JXA2fox Yg7W5vZXzwXlCj9O7vk6jmt0OEpB/SUC+gFOqz+LlVmyNmLvS9fqfnSPlZ8PU6Qwm4aK 3w7a8xuaEC3z6oBTVY+hJsaSQ567lLbkliRDg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=rZJ7073W1HvelSedh2Eh+3wT6EaisJRA4zd6/Fnp0e4vWV4MsyEGFaDnKpr0XP2GcN TbAub3Js003SxGoGOOP7XykoSJf+WoZKH8kTMJzFb2kTkkNfwoTEOWnnDkkvCPXPoczK csb03hPg0ZBIHO+XKlTgzKdk2yLc2sBLk5qk4= MIME-Version: 1.0 Received: by 10.210.82.13 with SMTP id f13mr2328274ebb.48.1241419383548; Sun, 03 May 2009 23:43:03 -0700 (PDT) In-Reply-To: References: Date: Mon, 4 May 2009 09:43:03 +0300 Message-ID: Subject: Re: Boosting query - debuging From: liat oren To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0015174c17b45fae040469107909 X-Virus-Checked: Checked by ClamAV on apache.org --0015174c17b45fae040469107909 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit I looked into the output again, and saw that the explain method, explains a different result then the document i thought it did. Within the loop of the results, I replaced int docId = hits[j].doc; Document curDoc = searcher.doc(docId); with Document curDoc = searcher.doc(j); So I got the right explain to the documnet. The strange things I got are: 1. The explain is much shorter as you can see below 2. the score of finlin (1.6479614) is differnt than the one in the explain (0.34333253) 3. I think it is because of the fieldNorm. why is it differnt than the one of TTD? finlin, score: 1.6479614 0.3433253 = (MATCH) fieldWeight(worlds:666666 in 0), product of: 0.70710677 = (MATCH) btq, product of: 0.70710677 = tf(phraseFreq=0.5) 1.0 = scorePayload(...) 0.7768564 = idf(worlds: 666666=4) 0.625 = fieldNorm(field=worlds, doc=0) TTD, score: 1.6479614 1.6479613 = (MATCH) fieldWeight(worlds:666666 in 1), product of: 2.1213202 = (MATCH) btq, product of: 0.70710677 = tf(phraseFreq=0.5) 3.0 = scorePayload(...) 0.7768564 = idf(worlds: 666666=4) 1.0 = fieldNorm(field=worlds, doc=1) Thanks again, Liat 2009/5/3 liat oren > Hi, > > I try to debug boosting query. > Is there a way to see the term boost in the documents? I see them in spans > in BoostingTermQuery, yet, from there I can't see which document I am in. > If I want to copy some of the document in an index that saves the boosting > - how can it be done? > > The problem I am facing is that I get unexpected results - If for word "a", > I have the worlds "1111" (boosting 3) and "2222" and for word "b" I have the > world "1111". When I try to search for "1111" (boosting 5), word "a" gets > better results. > > When I debugged it, I saw that the boosting is always three, but since in > the index I have a lot of documents, I tried to do the same on a smaller > index. > > I put only two words as you can see in the code below (I put all the > methods and classes needed to run this code). > > The problem I saw here is the scorePayload in the Explain method - it took > a differnt value from the one I indexed. > You can see below the output - for TTD - 1.0 = scorePayload(...) > and for finlin 3.0 = scorePayload(...) > while the boosting I used was the opposite - for TTD, I used 3 and for > finlin, I used 1 > > The scorePayload should be the factor I put when I indexed, right? > > Thanks a lot, > Liat > > TTD, score: 1.2611988 > > 0.26274973 = (MATCH) weight(worlds:666666 in 0), product of: > 0.99999994 = queryWeight(worlds:666666), product of: > 0.5945349 = idf(worlds: 666666=2) > 1.681987 = queryNorm > 0.26274976 = (MATCH) fieldWeight(worlds:666666 in 0), product of: > 0.70710677 = (MATCH) btq, product of: > 0.70710677 = tf(phraseFreq=0.5) > 1.0 = scorePayload(...) > 0.5945349 = idf(worlds: 666666=2) > 0.625 = fieldNorm(field=worlds, doc=0) > ******************************************************** > finlin, score: 0.26274976 > > 1.2611988 = (MATCH) weight(worlds:666666 in 1), product of: > 0.99999994 = queryWeight(worlds:666666), product of: > 0.5945349 = idf(worlds: 666666=2) > 1.681987 = queryNorm > 1.2611989 = (MATCH) fieldWeight(worlds:666666 in 1), product of: > 2.1213202 = (MATCH) btq, product of: > 0.70710677 = tf(phraseFreq=0.5) > 3.0 = scorePayload(...) > 0.5945349 = idf(worlds: 666666=2) > 1.0 = fieldNorm(field=worlds, doc=1) > > *The code* > ** > public class Test > { > public Test() > { > } > public static void main(String[] args) throws IOException, Exception > { > Test st = new Test(); > st.index(); // > st.testRealIndex(); > } > public void index() throws IOException > { > DoubleMap wordMap = new DoubleMap(); > wordMap.insert("TTD", 666666, 3); > wordMap.insert("finlin", 666666, 1); > wordMap.insert("finlin", 222222, 2); > index(wordMap, "wordIndexTry", "", "0"); > } > public synchronized void index(DoubleMap doubleMap, String dirPath, String > originalPath, String includeFreq) throws IOException > { > File f = new File(dirPath); > IndexWriter writer = null; > PayloadAnalyzer panalyzer = new PayloadAnalyzer(); > if(f.exists()) > { > writer = new IndexWriter(dirPath, panalyzer, false); > } > else > { > writer = new IndexWriter(dirPath, panalyzer, true); > } > Iterator it = doubleMap.getMap().entrySet().iterator(); > int count = 0; > int size = doubleMap.getMap().size(); > while(it.hasNext()) > { > count++; > Map.Entry entry = (Map.Entry) it.next(); > String word = entry.getKey().toString(); > Word w = new Word(); > w.word = word; > Date date = new Date(); > System.out.println(date.toString() + " : Updateing word " + word + " ( " > + count + " out of " + size + ") " + " FROM " + originalPath); > Map innerMap = (Map) entry.getValue(); > Map scoresMap = processMap(writer, panalyzer, innerMap, > entry, w, dirPath, includeFreq); > index(writer, panalyzer, innerMap, scoresMap, w, dirPath, includeFreq); > } > System.out.println("Optimizing " + dirPath + " ..."); > writer.optimize(); > writer.close(); > } > public synchronized Map processMap(IndexWriter writer, > PayloadAnalyzer panalyzer, Map innerMap, Map.Entry entry, Word > w, String dirPath, String includeFreq) throws IOException > { > Map scoresMap = new HashMap(); > Iterator worldsIter = innerMap.entrySet().iterator(); > String worlds = ""; > synchronized(worldsIter) > { > while(worldsIter.hasNext()) > { > Map.Entry worldsEntry = (Map.Entry) worldsIter.next(); > String world = worldsEntry.getKey().toString(); > int freq = (int) Double.parseDouble(worldsEntry.getValue().toString()); > scoresMap.put(world, freq); > worlds += world + " "; > FileUtil.writeToFile("Output\\WordWorldsFreq.txt", w.word + > Constants.TAB_SEP + world + Constants.TAB_SEP + freq); > } > } > panalyzer.setMapScores(scoresMap); > //MapUtil.copyStringIntMap(scoresMap)); > return scoresMap; > } > public synchronized void index(IndexWriter writer, PayloadAnalyzer > panalyzer, Map innerMap, Map scoresMap, Word > w, String dirPath, String includeFreq) throws IOException > { > System.out.println("indexing"); > w.worldsMap = innerMap; > WordIndex wi = new WordIndex(w); > wi.createDocument(includeFreq); > writer.addDocument(wi.getDocument()); > } > public void testRealIndex() throws IOException > { > String word = "TTD"; > String worlds = "666666"; > DoubleMap wordsWorldsFreqMap = new DoubleMap(); > wordsWorldsFreqMap.insert("TTD", 666666, 1.0); > BoostingBooleanQueryParser bbqp = new BoostingBooleanQueryParser(); > BooleanQuery bq = bbqp.parse(word, worlds, wordsWorldsFreqMap, "worlds"); > IndexSearcher searcher = new IndexSearcher("wordIndexTry"); > //D:\\PaiDatabase\\Indexes\\WordIndex"); > searcher.setSimilarity(new WordsSimilarity()); > TopDocCollector collector = new TopDocCollector(30); > searcher.search(bq, collector); > ScoreDoc[] hits = collector.topDocs().scoreDocs; > for(int j = 0; j < Math.min(hits.length, 10); j++) > { > int docId = hits[j].doc; > Document curDoc = searcher.doc(docId); > System.out.println(curDoc.getField("word").stringValue() + ", score: " + > hits[j].score); > Explanation explanation = searcher.explain(bq, j); > System.out.println(explanation.toString()); > String sym = curDoc.getField("word").stringValue(); > } > } > public abstract class Index > { > protected Document doc = new Document(); > public Index() > { > } > public Document getDocument() > { > return doc; > } > public void setDocument(Document d) > { > this.doc = d; > } > } > public class WordIndex extends Index > { > protected Word w; > public String FIELD_WORD = "word"; > public String FIELD_WORLDS = "worlds"; > public WordIndex(Word w) > { > this.w = w; > } > public void createDocument(String includeFreq) throws > java.io.FileNotFoundException > { > // make a new, empty document > doc = new Document(); > doc.add(new Field(FIELD_WORD, w.word, Field.Store.YES, > Field.Index.NOT_ANALYZED)); > doc.add(new Field(FIELD_WORLDS, > String.valueOf(w.getWorldIds(includeFreq)), Field.Store.YES, > Field.Index.ANALYZED, Field.TermVector.YES)); > } > public Document getDoc(String word, String indexPath) throws IOException > { > IndexSearcher mapSearcher = new IndexSearcher(indexPath); > TermQuery mapQuery = new TermQuery(new Term(FIELD_WORD, word)); > Hits mapHits = mapSearcher.search(mapQuery); > if(mapHits.length() != 0) > { > Document doc = mapHits.doc(0); > return doc; > } > return null; > } > } > public class Word > { > public String word; > public Map worldsMap = new HashMap(); > public Word() > { > } > public String getWorldIds(String includeFreq) > { > String worlds = ""; > Iterator iter = worldsMap.entrySet().iterator(); > while(iter.hasNext()) > { > Map.Entry entry = (Map.Entry) iter.next(); > if(includeFreq.equals("1")) > { > int freq = (int) Double.parseDouble(entry.getValue().toString()); > for(int i = 0; i < freq; i++) > { > worlds += entry.getKey().toString() + " "; > } > } > else > { > worlds += entry.getKey().toString() + " "; > } > } > return worlds; > } > } > public class DoubleMap > { > private Map> map; > public Map worldsListMap = new HashMap(); > public List entriesList = new ArrayList(); > public DoubleMap() > { > map = new HashMap>(); > } > public void insert(String word, long worldId, double beta) > { > if(map.get(word) != null) > { > Map innerMap = map.get(word); > if(innerMap.get(worldId) != null) > { > return; > } > innerMap.put(worldId, beta); > map.put(word, innerMap); > } > else > { > Map innerMap = new HashMap(); > innerMap.put(worldId, beta); > map.put(word, innerMap); > } > } > public void insert(String word, long worldId, double beta, int size) > { > if(map.get(word) != null) > { > Map innerMap = map.get(word); > if(innerMap.get(worldId) != null) > { > return; > } > if(innerMap.size() == size) > { > Iterator iter = innerMap.entrySet().iterator(); > int count = 0; > while(iter.hasNext()) > { > Map.Entry entry = (Map.Entry) iter.next(); > count++; > } > System.out.println(count); > long minWorldId = getMinItem(innerMap); > innerMap.remove(minWorldId); > } > innerMap.put(worldId, beta); > map.put(word, innerMap); > } > else > { > Map innerMap = new HashMap(); > innerMap.put(worldId, beta); > map.put(word, innerMap); > } > } > private long getMinItem(Map innerMap) > { > Iterator it = innerMap.entrySet().iterator(); > long worldId = -1; > while(it.hasNext()) > { > Map.Entry entry = (Map.Entry) it.next(); > worldId = Long.parseLong(entry.getKey().toString()); > } > return worldId; > } > public Map> getMap() > { > return map; > } > } > public class BoostingBooleanQueryParser > { > public BoostingBooleanQueryParser() > { > } > public BooleanQuery parse(String word, String worlds, DoubleMap > wordsWorldsFreqMap, String fieldName) throws IOException > { > BooleanQuery bq = new BooleanQuery(); > String[] splitWorlds = worlds.split(" "); > for(int i = 0; i < splitWorlds.length; i++) > { > double freq = > wordsWorldsFreqMap.getMap().get(word).get(Long.parseLong(splitWorlds[i])); > BoostingTermQuery tq = new BoostingTermQuery(new Term(fieldName, > splitWorlds[i])); > tq.setBoost((float) freq); > bq.add(tq, BooleanClause.Occur.SHOULD); > } > return bq; > } > } > public class PayloadAnalyzer extends Analyzer > { > private PayloadTokenStream payToken = null; > private int score; > private Map scoresMap = new HashMap(); > public synchronized void setScore(int s) > { > score = s; > } > public synchronized void setMapScores(Map scoresMap) > { > this.scoresMap = scoresMap; > } > public final TokenStream tokenStream(String field, Reader reader) > { > payToken = new PayloadTokenStream(new WhitespaceTokenizer(reader)); > //new LowerCaseTokenizer(reader)); > payToken.setScore(score); > payToken.setMapScores(scoresMap); > return payToken; > } > } > public class PayloadTokenStream extends TokenStream > { > private Tokenizer tok = null; > private int score; > private Map scoresMap = new HashMap(); > public PayloadTokenStream(Tokenizer tokenizer) > { > tok = tokenizer; > } > public void setScore(int s) > { > score = s; > } > public synchronized void setMapScores(Map scoresMap) > { > this.scoresMap = scoresMap; > } > public Token next(Token t) throws IOException > { > t = tok.next(t); > if(t != null) > { > //t.setTermBuffer("can change"); > //Do something with the data > byte[] bytes = ("score:" + score).getBytes(); > // t.setPayload(new Payload(bytes)); > String word = String.copyValueOf(t.termBuffer(), 0, t.termLength()); > if(!word.equals("") && word != null) > { > int score = scoresMap.get(word); > if(score > 127) > { > score = 127; > } > byte payLoad = Byte.parseByte(String.valueOf(score)); > t.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) })); > } > } > return t; > } > public void reset(Reader input) throws IOException > { > tok.reset(input); > } > public void close() throws IOException > { > tok.close(); > } > } > } > --0015174c17b45fae040469107909--