lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Murat Yakici" <Murat.Yak...@cis.strath.ac.uk>
Subject Re: Using Payloads
Date Sun, 26 Apr 2009 11:27:44 GMT

Yes, this is more or less what I had in mind. However, for this approach
one requires some *prior knowledge* of the vocabulary of the document (or
the collection) to produce that score before even it gets analyzed, isn't
it? And this is the paradox that I have been thinking. If you have that
knowledge, that's fine. In addition, for applications that only require a
small term window to generate a score (such as term in context score) this
can be implemented very easy.

It is possible to inject the document dependent boost/score generation
*logic* (an interface would do) to the Tokenizer/TokenStream. However, I
am afraid this may have an indexing time penalty. If your window size is
the document itself, you will be doing the same job twice (calculating the
num of times a term occurs in doc X, index time weights etc.). IndexWriter
already does these somewhere down deep.


Simply put, I want to add some scores to documents/terms, but I can't
generate that score before I observe the document/terms. If I do that I
would replicate some of the work that is being already done by
IndexWriter.

If I remember it correctly, there is also some intention to add document
payloads functionality. I have the same concerns on this. So I think we
need a clear view on the topic. Where is the payload work moving? How we
can generate a score without duplicating some of the work that IndexWriter
is doing?   I guess Michael Busch is working on document payloads for
release 3.0. I would appreciate if someone can enlighten us on how that
would work and can be utilised, in particularly during the analysis phase?


Cheers,
Murat

> Thanks, Murat.
> It was very useful - I also tried to override IndexWriter and
> DocumentsWriter instead, but it didn't work well. DocumentsWriter  can't
> be
> overriden.
>
> So, I didn't find a better way to make the changes.
>
> My needs are having for every term in different documents different
> values.
> So, like you set the boost at the document level, I would like to set the
> boost for different terms within differnt documents.
>
> For that matter, I made some changes in the code you sent - (I coloured
> the
> changes in red):
>
> Below you can find an example for the use of it
>
> **********
>  private class PayloadAnalyzer extends Analyzer
>  {
>   private PayloadTokenStream payToken = null;
>   private int score;
>   *private Map<String, Integer> scoresMap = new HashMap<String,
> Integer>();*
>   public synchronized void setScore(int s)
>   {
>    score = s;
>   }
> *  public synchronized void setMapScores(Map<String, Integer> scoresMap)
>   {
>    this.scoresMap = scoresMap;
>   }*
>   public final TokenStream tokenStream(String field, Reader reader)
>   {
>    payToken = new PayloadTokenStream(new WhitespaceTokenizer(reader));
> //new
> LowerCaseTokenizer(reader));
>    payToken.setScore(score);
>    payToken.setMapScores(scoresMap);
>    return payToken;
>   }
>  }
>  private class PayloadTokenStream extends TokenStream
>  {
>   private Tokenizer tok = null;
>   private int score;
>   *private Map<String, Integer> scoresMap = new HashMap<String,
> Integer>();*
>   public PayloadTokenStream(Tokenizer tokenizer)
>   {
>    tok = tokenizer;
>   }
>   public void setScore(int s)
>   {
>    score = s;
>   }
>  * public synchronized void setMapScores(Map<String, Integer> scoresMap)
>   {
>    this.scoresMap = scoresMap;
>   }*
>   public Token next(Token t) throws IOException
>   {
>    t = tok.next(t);
>    if(t != null)
>    {
>     //t.setTermBuffer("can change");
>     //Do something with the data
>     byte[] bytes = ("score:" + score).getBytes();
>     //    t.setPayload(new Payload(bytes));
> *    String word = String.copyValueOf(t.termBuffer(), 0, t.termLength());
>     int score = scoresMap.get(word);
>     byte payLoad = Byte.parseByte(String.valueOf(score));
>     t.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) }));*
>    }
>    return t;
>   }
>   public void reset(Reader input) throws IOException
>   {
>    tok.reset(input);
>   }
>   public void close() throws IOException
>   {
>    tok.close();
>   }
>  }
> **********************************
> *Example for the use of payloads:*
>
>   PayloadAnalyzer panalyzer = new PayloadAnalyzer();
>   File index = new File("" + "TestSearchIndex");
>   IndexWriter iwriter = new IndexWriter(index, panalyzer);
>   Document d = new Document();
>   d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
> Field.Index.TOKENIZED, Field.TermVector.YES));
>   d.add(new Field("id", "1^3", Field.Store.YES, Field.Index.UN_TOKENIZED,
> Field.TermVector.NO));
>   Map<String, Integer> mapScores = new HashMap<String, Integer>();
>   mapScores.put("word1", 3);
>   mapScores.put("word2", 1);
>   mapScores.put("word3", 1);
>   panalyzer.setMapScores(mapScores);
>   iwriter.addDocument(d, panalyzer);
>   d = new Document();
>   d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
> Field.Index.TOKENIZED, Field.TermVector.YES));
>   d.add(new Field("id", "2^3", Field.Store.YES, Field.Index.UN_TOKENIZED,
> Field.TermVector.NO));
>   //We set the score for the term of the document that will be
> analyzed.
>   /*I was worried about this part - document dependent score
>   which may be utilized*/
>   mapScores = new HashMap<String, Integer>();
>   mapScores.put("word1", 1);
>   mapScores.put("word2", 3);
>   mapScores.put("word3", 1);
>   panalyzer.setMapScores(mapScores);
>   iwriter.addDocument(d, panalyzer);
>   /*-----------------*/
>   //  iwriter.commit();
>   iwriter.optimize();
>   iwriter.close();
>   BooleanQuery bq = new BooleanQuery();
>   BoostingTermQuery tq = new BoostingTermQuery(new Term("text", "word1"));
>   tq.setBoost((float) 1.0);
>   bq.add(tq, BooleanClause.Occur.MUST);
>   tq = new BoostingTermQuery(new Term("text", "word2"));
>   tq.setBoost((float) 3);
>   bq.add(tq, BooleanClause.Occur.SHOULD);
>   tq = new BoostingTermQuery(new Term("text", "word3"));
>   tq.setBoost((float) 1);
>   bq.add(tq, BooleanClause.Occur.SHOULD);
>   IndexSearcher searcher1 = new IndexSearcher("TestSearchIndex");
>   searcher1.setSimilarity(new WordsSimilarity());
>   TopDocs topDocs = searcher1.search(bq, null, 3);
>   Hits hits1 = searcher1.search(bq);
>   for(int j = 0; j < hits1.length(); j++)
>   {
>    Explanation explanation = searcher1.explain(bq, j);
>    System.out.println("**** " + hits1.score(j) + " " +
> hits1.doc(j).getField("id").stringValue() + " *****");
>    System.out.println(explanation.toString());
>    explanation.getValue();
>    System.out.println("********************************************************");
>    System.out.println("Score " + topDocs.scoreDocs[j].score + " doc " +
> searcher1.doc(topDocs.scoreDocs[j].doc).getField("id").stringValue());
>    System.out.println("********************************************************");
>   }
>
> If you try the same query with differnt boosting, you will get a different
> order for the documents.
>
> Does it look ok?
>
> Thanks again!
> Liat
> 2009/4/25 Murat Yakici <Murat.Yakici@cis.strath.ac.uk>
>
>>
>>
>> Here is what I am doing, not so magical... There are two classes, an
>> analyzer and an a TokenStream in which I can inject my document
>> dependent
>> data to be stored as payload.
>>
>>
>> private PayloadAnalyzer panalyzer = new PayloadAnalyzer();
>>
>>    private class PayloadAnalyzer extends Analyzer {
>>
>>        private PayloadTokenStream payToken = null;
>>        private int score;
>>
>>        public synchronized void setScore(int s) {
>>            score=s;
>>        }
>>
>>      public final TokenStream tokenStream(String field, Reader reader) {
>>         payToken = new PayloadTokenStream(new
>> LowerCaseTokenizer(reader));
>>         payToken.setScore(score);
>>         return payToken;
>>        }
>>    }
>>
>>    private class PayloadTokenStream extends TokenStream {
>>
>>        private Tokenizer tok = null;
>>        private int score;
>>
>>        public PayloadTokenStream(Tokenizer tokenizer) {
>>            tok = tokenizer;
>>        }
>>
>>        public void setScore(int s) {
>>            score = s;
>>        }
>>
>>        public Token next(Token t) throws IOException {
>>            t = tok.next(t);
>>            if (t != null) {
>>                //t.setTermBuffer("can change");
>>                //Do something with the data
>>                byte[] bytes = ("score:"+ score).getBytes();
>>                t.setPayload(new Payload(bytes));
>>            }
>>            return t;
>>        }
>>
>>        public void reset(Reader input) throws IOException {
>>            tok.reset(input);
>>        }
>>
>>        public void close() throws IOException {
>>            tok.close();
>>        }
>>    }
>>
>>
>>    public void doIndex() {
>>        try {
>>            File index = new File("./TestPayloadIndex");
>>            IndexWriter iwriter = new IndexWriter(index,
>>                     panalyzer,
>>                     IndexWriter.MaxFieldLength.UNLIMITED);
>>
>>            Document d = new Document();
>>            d.add(new Field("content",
>>               "Everyone, someone, myTerm, yourTerm", Field.Store.YES,
>>                Field.Index.ANALYZED, Field.TermVector.YES));
>>            //We set the score for the term of the document that will be
>> analyzed.
>>            /*I was worried about this part - document dependent score
>> which may be utilized*/
>>            panalyzer.setScore(5);
>>            iwriter.addDocument(d, panalyzer);
>>            /*-----------------*/
>>            ...
>>            iwriter.commit();
>>            iwriter.optimize();
>>            iwriter.close();
>>
>>            //Now read the index
>>            IndexReader ireader = IndexReader.open(index);
>>            TermPositions tpos = ireader.termPositions(
>>                                  new Term("content","myterm"));//Note
>> LowercaseTokenizer
>>            while (tpos.next()) {
>>                int pos;
>>                for(int i=0;i<tpos.freq();i++){
>>                    pos=tpos.nextPosition();
>>                    if (tpos.isPayloadAvailable()) {
>>                        byte[] data = new byte[tpos.getPayloadLength()];
>>                        tpos.getPayload(data, 0);
>>                       //Utilise payloads;
>>                    }
>>                }
>>            }
>>
>>            tpos.close();
>>        } catch (CorruptIndexException ex) {
>>           //
>>        } catch (LockObtainFailedException ex) {
>>            //
>>        } catch (IOException ex) {
>>            //
>>        }
>>    }
>>
>> I wish it was designed better... Please let me know if you guys have a
>> better idea.
>>
>> Cheers,
>> Murat
>>
>> > Dear Murat,
>> >
>> > I saw your question and wondered how did you implement these changes?
>> > The requirement below are the same ones as I am trying to code now.
>> > Did you modify the source code itself or only used Lucene's jar and
>> just
>> > override code?
>> >
>> > I would very much apprecicate if you could give me a short explanation
>> on
>> > how was it done.
>> >
>> > Thanks a lot,
>> > Liat
>> >
>> > 2009/4/21 Murat Yakici <murat.yakici@cis.strath.ac.uk>
>> >
>> >> Hi,
>> >> I started playing with the experimental payload functionality. I have
>> >> written an analyzer which adds a payload (some sort of a score/boost)
>> >> for
>> >> each term occurance. The payload/score for each term is dependent on
>> the
>> >> document that the term comes from (I guess this is the typoical use
>> >> case).
>> >> So say term t1 may have a payload of 5 in doc1 and 34 in doc5. The
>> >> parameter
>> >> for calculating the payload changes after each
>> >> indexWriter.addDocument(..)
>> >> method call in a while loop. I am assuming that the
>> >> indexWriter.addDocument(..) methods are thread safe. Can I confirm
>> this?
>> >>
>> >> Cheers,
>> >>
>> >> --
>> >> Murat Yakici
>> >> Department of Computer & Information Sciences
>> >> University of Strathclyde
>> >> Glasgow, UK
>> >> -------------------------------------------
>> >> The University of Strathclyde is a charitable body, registered in
>> >> Scotland,
>> >> with registration number SC015263.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>>
>> Murat Yakici
>> Department of Computer & Information Sciences
>> University of Strathclyde
>> Glasgow, UK
>> -------------------------------------------
>> The University of Strathclyde is a charitable body, registered in
>> Scotland,
>> with registration number SC015263.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


Murat Yakici
Department of Computer & Information Sciences
University of Strathclyde
Glasgow, UK
-------------------------------------------
The University of Strathclyde is a charitable body, registered in Scotland,
with registration number SC015263.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message