lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From liat oren <oren.l...@gmail.com>
Subject Re: Using Payloads
Date Sun, 26 Apr 2009 09:06:02 GMT
Thanks, Murat.
It was very useful - I also tried to override IndexWriter and
DocumentsWriter instead, but it didn't work well. DocumentsWriter  can't be
overriden.

So, I didn't find a better way to make the changes.

My needs are having for every term in different documents different values.
So, like you set the boost at the document level, I would like to set the
boost for different terms within differnt documents.

For that matter, I made some changes in the code you sent - (I coloured the
changes in red):

Below you can find an example for the use of it

**********
 private class PayloadAnalyzer extends Analyzer
 {
  private PayloadTokenStream payToken = null;
  private int score;
  *private Map<String, Integer> scoresMap = new HashMap<String, Integer>();*
  public synchronized void setScore(int s)
  {
   score = s;
  }
*  public synchronized void setMapScores(Map<String, Integer> scoresMap)
  {
   this.scoresMap = scoresMap;
  }*
  public final TokenStream tokenStream(String field, Reader reader)
  {
   payToken = new PayloadTokenStream(new WhitespaceTokenizer(reader)); //new
LowerCaseTokenizer(reader));
   payToken.setScore(score);
   payToken.setMapScores(scoresMap);
   return payToken;
  }
 }
 private class PayloadTokenStream extends TokenStream
 {
  private Tokenizer tok = null;
  private int score;
  *private Map<String, Integer> scoresMap = new HashMap<String, Integer>();*
  public PayloadTokenStream(Tokenizer tokenizer)
  {
   tok = tokenizer;
  }
  public void setScore(int s)
  {
   score = s;
  }
 * public synchronized void setMapScores(Map<String, Integer> scoresMap)
  {
   this.scoresMap = scoresMap;
  }*
  public Token next(Token t) throws IOException
  {
   t = tok.next(t);
   if(t != null)
   {
    //t.setTermBuffer("can change");
    //Do something with the data
    byte[] bytes = ("score:" + score).getBytes();
    //    t.setPayload(new Payload(bytes));
*    String word = String.copyValueOf(t.termBuffer(), 0, t.termLength());
    int score = scoresMap.get(word);
    byte payLoad = Byte.parseByte(String.valueOf(score));
    t.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) }));*
   }
   return t;
  }
  public void reset(Reader input) throws IOException
  {
   tok.reset(input);
  }
  public void close() throws IOException
  {
   tok.close();
  }
 }
**********************************
*Example for the use of payloads:*

  PayloadAnalyzer panalyzer = new PayloadAnalyzer();
  File index = new File("" + "TestSearchIndex");
  IndexWriter iwriter = new IndexWriter(index, panalyzer);
  Document d = new Document();
  d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
Field.Index.TOKENIZED, Field.TermVector.YES));
  d.add(new Field("id", "1^3", Field.Store.YES, Field.Index.UN_TOKENIZED,
Field.TermVector.NO));
  Map<String, Integer> mapScores = new HashMap<String, Integer>();
  mapScores.put("word1", 3);
  mapScores.put("word2", 1);
  mapScores.put("word3", 1);
  panalyzer.setMapScores(mapScores);
  iwriter.addDocument(d, panalyzer);
  d = new Document();
  d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
Field.Index.TOKENIZED, Field.TermVector.YES));
  d.add(new Field("id", "2^3", Field.Store.YES, Field.Index.UN_TOKENIZED,
Field.TermVector.NO));
  //We set the score for the term of the document that will be
analyzed.
  /*I was worried about this part - document dependent score
  which may be utilized*/
  mapScores = new HashMap<String, Integer>();
  mapScores.put("word1", 1);
  mapScores.put("word2", 3);
  mapScores.put("word3", 1);
  panalyzer.setMapScores(mapScores);
  iwriter.addDocument(d, panalyzer);
  /*-----------------*/
  //  iwriter.commit();
  iwriter.optimize();
  iwriter.close();
  BooleanQuery bq = new BooleanQuery();
  BoostingTermQuery tq = new BoostingTermQuery(new Term("text", "word1"));
  tq.setBoost((float) 1.0);
  bq.add(tq, BooleanClause.Occur.MUST);
  tq = new BoostingTermQuery(new Term("text", "word2"));
  tq.setBoost((float) 3);
  bq.add(tq, BooleanClause.Occur.SHOULD);
  tq = new BoostingTermQuery(new Term("text", "word3"));
  tq.setBoost((float) 1);
  bq.add(tq, BooleanClause.Occur.SHOULD);
  IndexSearcher searcher1 = new IndexSearcher("TestSearchIndex");
  searcher1.setSimilarity(new WordsSimilarity());
  TopDocs topDocs = searcher1.search(bq, null, 3);
  Hits hits1 = searcher1.search(bq);
  for(int j = 0; j < hits1.length(); j++)
  {
   Explanation explanation = searcher1.explain(bq, j);
   System.out.println("**** " + hits1.score(j) + " " +
hits1.doc(j).getField("id").stringValue() + " *****");
   System.out.println(explanation.toString());
   explanation.getValue();
   System.out.println("********************************************************");
   System.out.println("Score " + topDocs.scoreDocs[j].score + " doc " +
searcher1.doc(topDocs.scoreDocs[j].doc).getField("id").stringValue());
   System.out.println("********************************************************");
  }

If you try the same query with differnt boosting, you will get a different
order for the documents.

Does it look ok?

Thanks again!
Liat
2009/4/25 Murat Yakici <Murat.Yakici@cis.strath.ac.uk>

>
>
> Here is what I am doing, not so magical... There are two classes, an
> analyzer and an a TokenStream in which I can inject my document dependent
> data to be stored as payload.
>
>
> private PayloadAnalyzer panalyzer = new PayloadAnalyzer();
>
>    private class PayloadAnalyzer extends Analyzer {
>
>        private PayloadTokenStream payToken = null;
>        private int score;
>
>        public synchronized void setScore(int s) {
>            score=s;
>        }
>
>      public final TokenStream tokenStream(String field, Reader reader) {
>         payToken = new PayloadTokenStream(new LowerCaseTokenizer(reader));
>         payToken.setScore(score);
>         return payToken;
>        }
>    }
>
>    private class PayloadTokenStream extends TokenStream {
>
>        private Tokenizer tok = null;
>        private int score;
>
>        public PayloadTokenStream(Tokenizer tokenizer) {
>            tok = tokenizer;
>        }
>
>        public void setScore(int s) {
>            score = s;
>        }
>
>        public Token next(Token t) throws IOException {
>            t = tok.next(t);
>            if (t != null) {
>                //t.setTermBuffer("can change");
>                //Do something with the data
>                byte[] bytes = ("score:"+ score).getBytes();
>                t.setPayload(new Payload(bytes));
>            }
>            return t;
>        }
>
>        public void reset(Reader input) throws IOException {
>            tok.reset(input);
>        }
>
>        public void close() throws IOException {
>            tok.close();
>        }
>    }
>
>
>    public void doIndex() {
>        try {
>            File index = new File("./TestPayloadIndex");
>            IndexWriter iwriter = new IndexWriter(index,
>                     panalyzer,
>                     IndexWriter.MaxFieldLength.UNLIMITED);
>
>            Document d = new Document();
>            d.add(new Field("content",
>               "Everyone, someone, myTerm, yourTerm", Field.Store.YES,
>                Field.Index.ANALYZED, Field.TermVector.YES));
>            //We set the score for the term of the document that will be
> analyzed.
>            /*I was worried about this part - document dependent score
> which may be utilized*/
>            panalyzer.setScore(5);
>            iwriter.addDocument(d, panalyzer);
>            /*-----------------*/
>            ...
>            iwriter.commit();
>            iwriter.optimize();
>            iwriter.close();
>
>            //Now read the index
>            IndexReader ireader = IndexReader.open(index);
>            TermPositions tpos = ireader.termPositions(
>                                  new Term("content","myterm"));//Note
> LowercaseTokenizer
>            while (tpos.next()) {
>                int pos;
>                for(int i=0;i<tpos.freq();i++){
>                    pos=tpos.nextPosition();
>                    if (tpos.isPayloadAvailable()) {
>                        byte[] data = new byte[tpos.getPayloadLength()];
>                        tpos.getPayload(data, 0);
>                       //Utilise payloads;
>                    }
>                }
>            }
>
>            tpos.close();
>        } catch (CorruptIndexException ex) {
>           //
>        } catch (LockObtainFailedException ex) {
>            //
>        } catch (IOException ex) {
>            //
>        }
>    }
>
> I wish it was designed better... Please let me know if you guys have a
> better idea.
>
> Cheers,
> Murat
>
> > Dear Murat,
> >
> > I saw your question and wondered how did you implement these changes?
> > The requirement below are the same ones as I am trying to code now.
> > Did you modify the source code itself or only used Lucene's jar and just
> > override code?
> >
> > I would very much apprecicate if you could give me a short explanation on
> > how was it done.
> >
> > Thanks a lot,
> > Liat
> >
> > 2009/4/21 Murat Yakici <murat.yakici@cis.strath.ac.uk>
> >
> >> Hi,
> >> I started playing with the experimental payload functionality. I have
> >> written an analyzer which adds a payload (some sort of a score/boost)
> >> for
> >> each term occurance. The payload/score for each term is dependent on the
> >> document that the term comes from (I guess this is the typoical use
> >> case).
> >> So say term t1 may have a payload of 5 in doc1 and 34 in doc5. The
> >> parameter
> >> for calculating the payload changes after each
> >> indexWriter.addDocument(..)
> >> method call in a while loop. I am assuming that the
> >> indexWriter.addDocument(..) methods are thread safe. Can I confirm this?
> >>
> >> Cheers,
> >>
> >> --
> >> Murat Yakici
> >> Department of Computer & Information Sciences
> >> University of Strathclyde
> >> Glasgow, UK
> >> -------------------------------------------
> >> The University of Strathclyde is a charitable body, registered in
> >> Scotland,
> >> with registration number SC015263.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
> Murat Yakici
> Department of Computer & Information Sciences
> University of Strathclyde
> Glasgow, UK
> -------------------------------------------
> The University of Strathclyde is a charitable body, registered in Scotland,
> with registration number SC015263.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message