lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From liat oren <oren.l...@gmail.com>
Subject Re: Using Payloads
Date Mon, 27 Apr 2009 09:12:27 GMT
Yes, I agree with you - I also tried this approach in the past and it was
terribely slow - looping on the term vectors.

What I have done - is dividing indexes into steps - which of course, if can
be avoided, it will be more than great!!

As for my problem -  it was a code problems, I sloved it, thanks.
Is it recommended to use multi-thread in order to index?

Best,
Liat
2009/4/26 Murat Yakici <Murat.Yakici@cis.strath.ac.uk>

>
>
> See my comments:
>
> > Yes, for this specific part, I have this prior knowledge which is based
> on
> > a
> > training set.
> > About the things you raise here, there are two things you might mean, I
> am
> > not sure:
> >
> > 1. If you don't have that "prior" knowledge, then all it means you need
> to
> > modify the formula of the score, no? to give more weight to factors you
> > think to be more significant.
> > The TermFreqVector will have the term frequencies and you will be able to
> > edit the score formula so it will fit your needs.
> >
> Although I like the TermFreqVector approach, the API and everything, it is
> slower than using TermEnum and TermDocs. I don't have solid statistics,
> but I confirmed the fact during my development work on a 80+ core SG
> server. I wish TermFreqVector approach could load abit faster.
>
>
>
> > Or
> >
> > 2. Enable us to add statistics factors while indexing
>
> This is what I am talking about, *Indexing time*.
>
> >
> > Question - why do you want to edit these while indexing and not using
> them
> > at "search" time in the way you desire? At this stage you already have
> all
> > the statistics - frequencies of all terms within and outside documents
>
> First of all you don't have all the statistics. Because, there are some
> statistics that you simple can't calculate (or more correctly you
> shouldn't be due to performance) during the query scoring time. It will be
> an overkill.
>
> >
> >
> > As for my solution:
> > I tried to add documents to the index and for every document I have a
> > differnt map of factors for every term.
> > However, I get an exception: Exception in thread "main"
> > java.util.ConcurrentModificationException
> > It seems like two threads - one reads the map, one edits it (probably for
> > the next document), bump into each other.
> >
>
> Are you using a single Thread to do all the indexing or multiple threads
> adding document to the index? You need to explain the situation a bit
> more.
>
> Murat
>
> > I tried to put the read part and write part in two different synchronied
> > methods, but it still shows the same exception.
> >
> > Any idea how can it be solved?
> >
> > Best,
> > Liat
> >
> >
> > 2009/4/26 Murat Yakici <Murat.Yakici@cis.strath.ac.uk>
> >
> >>
> >> Yes, this is more or less what I had in mind. However, for this approach
> >> one requires some *prior knowledge* of the vocabulary of the document
> >> (or
> >> the collection) to produce that score before even it gets analyzed,
> >> isn't
> >> it? And this is the paradox that I have been thinking. If you have that
> >> knowledge, that's fine. In addition, for applications that only require
> >> a
> >> small term window to generate a score (such as term in context score)
> >> this
> >> can be implemented very easy.
> >>
> >> It is possible to inject the document dependent boost/score generation
> >> *logic* (an interface would do) to the Tokenizer/TokenStream. However, I
> >> am afraid this may have an indexing time penalty. If your window size is
> >> the document itself, you will be doing the same job twice (calculating
> >> the
> >> num of times a term occurs in doc X, index time weights etc.).
> >> IndexWriter
> >> already does these somewhere down deep.
> >>
> >>
> >> Simply put, I want to add some scores to documents/terms, but I can't
> >> generate that score before I observe the document/terms. If I do that I
> >> would replicate some of the work that is being already done by
> >> IndexWriter.
> >>
> >> If I remember it correctly, there is also some intention to add document
> >> payloads functionality. I have the same concerns on this. So I think we
> >> need a clear view on the topic. Where is the payload work moving? How we
> >> can generate a score without duplicating some of the work that
> >> IndexWriter
> >> is doing?   I guess Michael Busch is working on document payloads for
> >> release 3.0. I would appreciate if someone can enlighten us on how that
> >> would work and can be utilised, in particularly during the analysis
> >> phase?
> >>
> >>
> >> Cheers,
> >> Murat
> >>
> >> > Thanks, Murat.
> >> > It was very useful - I also tried to override IndexWriter and
> >> > DocumentsWriter instead, but it didn't work well. DocumentsWriter
> >> can't
> >> > be
> >> > overriden.
> >> >
> >> > So, I didn't find a better way to make the changes.
> >> >
> >> > My needs are having for every term in different documents different
> >> > values.
> >> > So, like you set the boost at the document level, I would like to set
> >> the
> >> > boost for different terms within differnt documents.
> >> >
> >> > For that matter, I made some changes in the code you sent - (I
> >> coloured
> >> > the
> >> > changes in red):
> >> >
> >> > Below you can find an example for the use of it
> >> >
> >> > **********
> >> >  private class PayloadAnalyzer extends Analyzer
> >> >  {
> >> >   private PayloadTokenStream payToken = null;
> >> >   private int score;
> >> >   *private Map<String, Integer> scoresMap = new HashMap<String,
> >> > Integer>();*
> >> >   public synchronized void setScore(int s)
> >> >   {
> >> >    score = s;
> >> >   }
> >> > *  public synchronized void setMapScores(Map<String, Integer>
> >> scoresMap)
> >> >   {
> >> >    this.scoresMap = scoresMap;
> >> >   }*
> >> >   public final TokenStream tokenStream(String field, Reader reader)
> >> >   {
> >> >    payToken = new PayloadTokenStream(new WhitespaceTokenizer(reader));
> >> > //new
> >> > LowerCaseTokenizer(reader));
> >> >    payToken.setScore(score);
> >> >    payToken.setMapScores(scoresMap);
> >> >    return payToken;
> >> >   }
> >> >  }
> >> >  private class PayloadTokenStream extends TokenStream
> >> >  {
> >> >   private Tokenizer tok = null;
> >> >   private int score;
> >> >   *private Map<String, Integer> scoresMap = new HashMap<String,
> >> > Integer>();*
> >> >   public PayloadTokenStream(Tokenizer tokenizer)
> >> >   {
> >> >    tok = tokenizer;
> >> >   }
> >> >   public void setScore(int s)
> >> >   {
> >> >    score = s;
> >> >   }
> >> >  * public synchronized void setMapScores(Map<String, Integer>
> >> scoresMap)
> >> >   {
> >> >    this.scoresMap = scoresMap;
> >> >   }*
> >> >   public Token next(Token t) throws IOException
> >> >   {
> >> >    t = tok.next(t);
> >> >    if(t != null)
> >> >    {
> >> >     //t.setTermBuffer("can change");
> >> >     //Do something with the data
> >> >     byte[] bytes = ("score:" + score).getBytes();
> >> >     //    t.setPayload(new Payload(bytes));
> >> > *    String word = String.copyValueOf(t.termBuffer(), 0,
> >> t.termLength());
> >> >     int score = scoresMap.get(word);
> >> >     byte payLoad = Byte.parseByte(String.valueOf(score));
> >> >     t.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) }));*
> >> >    }
> >> >    return t;
> >> >   }
> >> >   public void reset(Reader input) throws IOException
> >> >   {
> >> >    tok.reset(input);
> >> >   }
> >> >   public void close() throws IOException
> >> >   {
> >> >    tok.close();
> >> >   }
> >> >  }
> >> > **********************************
> >> > *Example for the use of payloads:*
> >> >
> >> >   PayloadAnalyzer panalyzer = new PayloadAnalyzer();
> >> >   File index = new File("" + "TestSearchIndex");
> >> >   IndexWriter iwriter = new IndexWriter(index, panalyzer);
> >> >   Document d = new Document();
> >> >   d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
> >> > Field.Index.TOKENIZED, Field.TermVector.YES));
> >> >   d.add(new Field("id", "1^3", Field.Store.YES,
> >> Field.Index.UN_TOKENIZED,
> >> > Field.TermVector.NO <http://field.termvector.no/> <
> http://field.termvector.no/>));
> >> >   Map<String, Integer> mapScores = new HashMap<String, Integer>();
> >> >   mapScores.put("word1", 3);
> >> >   mapScores.put("word2", 1);
> >> >   mapScores.put("word3", 1);
> >> >   panalyzer.setMapScores(mapScores);
> >> >   iwriter.addDocument(d, panalyzer);
> >> >   d = new Document();
> >> >   d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
> >> > Field.Index.TOKENIZED, Field.TermVector.YES));
> >> >   d.add(new Field("id", "2^3", Field.Store.YES,
> >> Field.Index.UN_TOKENIZED,
> >> > Field.TermVector.NO <http://field.termvector.no/> <
> http://field.termvector.no/>));
>  >> >   //We set the score for the term of the document that will be
> >> > analyzed.
> >> >   /*I was worried about this part - document dependent score
> >> >   which may be utilized*/
> >> >   mapScores = new HashMap<String, Integer>();
> >> >   mapScores.put("word1", 1);
> >> >   mapScores.put("word2", 3);
> >> >   mapScores.put("word3", 1);
> >> >   panalyzer.setMapScores(mapScores);
> >> >   iwriter.addDocument(d, panalyzer);
> >> >   /*-----------------*/
> >> >   //  iwriter.commit();
> >> >   iwriter.optimize();
> >> >   iwriter.close();
> >> >   BooleanQuery bq = new BooleanQuery();
> >> >   BoostingTermQuery tq = new BoostingTermQuery(new Term("text",
> >> "word1"));
> >> >   tq.setBoost((float) 1.0);
> >> >   bq.add(tq, BooleanClause.Occur.MUST);
> >> >   tq = new BoostingTermQuery(new Term("text", "word2"));
> >> >   tq.setBoost((float) 3);
> >> >   bq.add(tq, BooleanClause.Occur.SHOULD);
> >> >   tq = new BoostingTermQuery(new Term("text", "word3"));
> >> >   tq.setBoost((float) 1);
> >> >   bq.add(tq, BooleanClause.Occur.SHOULD);
> >> >   IndexSearcher searcher1 = new IndexSearcher("TestSearchIndex");
> >> >   searcher1.setSimilarity(new WordsSimilarity());
> >> >   TopDocs topDocs = searcher1.search(bq, null, 3);
> >> >   Hits hits1 = searcher1.search(bq);
> >> >   for(int j = 0; j < hits1.length(); j++)
> >> >   {
> >> >    Explanation explanation = searcher1.explain(bq, j);
> >> >    System.out.println("**** " + hits1.score(j) + " " +
> >> > hits1.doc(j).getField("id").stringValue() + " *****");
> >> >    System.out.println(explanation.toString());
> >> >    explanation.getValue();
> >> >
> >>
>  System.out.println("********************************************************");
> >> >    System.out.println("Score " + topDocs.scoreDocs[j].score + " doc "
> >> +
> >> > searcher1.doc(topDocs.scoreDocs[j].doc).getField("id").stringValue());
> >> >
> >>
>  System.out.println("********************************************************");
> >> >   }
> >> >
> >> > If you try the same query with differnt boosting, you will get a
> >> different
> >> > order for the documents.
> >> >
> >> > Does it look ok?
> >> >
> >> > Thanks again!
> >> > Liat
> >> > 2009/4/25 Murat Yakici <Murat.Yakici@cis.strath.ac.uk>
> >> >
> >> >>
> >> >>
> >> >> Here is what I am doing, not so magical... There are two classes, an
> >> >> analyzer and an a TokenStream in which I can inject my document
> >> >> dependent
> >> >> data to be stored as payload.
> >> >>
> >> >>
> >> >> private PayloadAnalyzer panalyzer = new PayloadAnalyzer();
> >> >>
> >> >>    private class PayloadAnalyzer extends Analyzer {
> >> >>
> >> >>        private PayloadTokenStream payToken = null;
> >> >>        private int score;
> >> >>
> >> >>        public synchronized void setScore(int s) {
> >> >>            score=s;
> >> >>        }
> >> >>
> >> >>      public final TokenStream tokenStream(String field, Reader
> >> reader) {
> >> >>         payToken = new PayloadTokenStream(new
> >> >> LowerCaseTokenizer(reader));
> >> >>         payToken.setScore(score);
> >> >>         return payToken;
> >> >>        }
> >> >>    }
> >> >>
> >> >>    private class PayloadTokenStream extends TokenStream {
> >> >>
> >> >>        private Tokenizer tok = null;
> >> >>        private int score;
> >> >>
> >> >>        public PayloadTokenStream(Tokenizer tokenizer) {
> >> >>            tok = tokenizer;
> >> >>        }
> >> >>
> >> >>        public void setScore(int s) {
> >> >>            score = s;
> >> >>        }
> >> >>
> >> >>        public Token next(Token t) throws IOException {
> >> >>            t = tok.next(t);
> >> >>            if (t != null) {
> >> >>                //t.setTermBuffer("can change");
> >> >>                //Do something with the data
> >> >>                byte[] bytes = ("score:"+ score).getBytes();
> >> >>                t.setPayload(new Payload(bytes));
> >> >>            }
> >> >>            return t;
> >> >>        }
> >> >>
> >> >>        public void reset(Reader input) throws IOException {
> >> >>            tok.reset(input);
> >> >>        }
> >> >>
> >> >>        public void close() throws IOException {
> >> >>            tok.close();
> >> >>        }
> >> >>    }
> >> >>
> >> >>
> >> >>    public void doIndex() {
> >> >>        try {
> >> >>            File index = new File("./TestPayloadIndex");
> >> >>            IndexWriter iwriter = new IndexWriter(index,
> >> >>                     panalyzer,
> >> >>                     IndexWriter.MaxFieldLength.UNLIMITED);
> >> >>
> >> >>            Document d = new Document();
> >> >>            d.add(new Field("content",
> >> >>               "Everyone, someone, myTerm, yourTerm", Field.Store.YES,
> >> >>                Field.Index.ANALYZED, Field.TermVector.YES));
> >> >>            //We set the score for the term of the document that will
> >> be
> >> >> analyzed.
> >> >>            /*I was worried about this part - document dependent score
> >> >> which may be utilized*/
> >> >>            panalyzer.setScore(5);
> >> >>            iwriter.addDocument(d, panalyzer);
> >> >>            /*-----------------*/
> >> >>            ...
> >> >>            iwriter.commit();
> >> >>            iwriter.optimize();
> >> >>            iwriter.close();
> >> >>
> >> >>            //Now read the index
> >> >>            IndexReader ireader = IndexReader.open(index);
> >> >>            TermPositions tpos = ireader.termPositions(
> >> >>                                  new Term("content","myterm"));//Note
> >> >> LowercaseTokenizer
> >> >>            while (tpos.next()) {
> >> >>                int pos;
> >> >>                for(int i=0;i<tpos.freq();i++){
> >> >>                    pos=tpos.nextPosition();
> >> >>                    if (tpos.isPayloadAvailable()) {
> >> >>                        byte[] data = new
> >> byte[tpos.getPayloadLength()];
> >> >>                        tpos.getPayload(data, 0);
> >> >>                       //Utilise payloads;
> >> >>                    }
> >> >>                }
> >> >>            }
> >> >>
> >> >>            tpos.close();
> >> >>        } catch (CorruptIndexException ex) {
> >> >>           //
> >> >>        } catch (LockObtainFailedException ex) {
> >> >>            //
> >> >>        } catch (IOException ex) {
> >> >>            //
> >> >>        }
> >> >>    }
> >> >>
> >> >> I wish it was designed better... Please let me know if you guys have
> >> a
> >> >> better idea.
> >> >>
> >> >> Cheers,
> >> >> Murat
> >> >>
> >> >> > Dear Murat,
> >> >> >
> >> >> > I saw your question and wondered how did you implement these
> >> changes?
> >> >> > The requirement below are the same ones as I am trying to code
now.
> >> >> > Did you modify the source code itself or only used Lucene's jar
and
> >> >> just
> >> >> > override code?
> >> >> >
> >> >> > I would very much apprecicate if you could give me a short
> >> explanation
> >> >> on
> >> >> > how was it done.
> >> >> >
> >> >> > Thanks a lot,
> >> >> > Liat
> >> >> >
> >> >> > 2009/4/21 Murat Yakici <murat.yakici@cis.strath.ac.uk>
> >> >> >
> >> >> >> Hi,
> >> >> >> I started playing with the experimental payload functionality.
I
> >> have
> >> >> >> written an analyzer which adds a payload (some sort of a
> >> score/boost)
> >> >> >> for
> >> >> >> each term occurance. The payload/score for each term is dependent
> >> on
> >> >> the
> >> >> >> document that the term comes from (I guess this is the typoical
> >> use
> >> >> >> case).
> >> >> >> So say term t1 may have a payload of 5 in doc1 and 34 in doc5.
The
> >> >> >> parameter
> >> >> >> for calculating the payload changes after each
> >> >> >> indexWriter.addDocument(..)
> >> >> >> method call in a while loop. I am assuming that the
> >> >> >> indexWriter.addDocument(..) methods are thread safe. Can I
confirm
> >> >> this?
> >> >> >>
> >> >> >> Cheers,
> >> >> >>
> >> >> >> --
> >> >> >> Murat Yakici
> >> >> >> Department of Computer & Information Sciences
> >> >> >> University of Strathclyde
> >> >> >> Glasgow, UK
> >> >> >> -------------------------------------------
> >> >> >> The University of Strathclyde is a charitable body, registered
in
> >> >> >> Scotland,
> >> >> >> with registration number SC015263.
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >> Murat Yakici
> >> >> Department of Computer & Information Sciences
> >> >> University of Strathclyde
> >> >> Glasgow, UK
> >> >> -------------------------------------------
> >> >> The University of Strathclyde is a charitable body, registered in
> >> >> Scotland,
> >> >> with registration number SC015263.
> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >>
> >>
> >> Murat Yakici
> >> Department of Computer & Information Sciences
> >> University of Strathclyde
> >> Glasgow, UK
> >> -------------------------------------------
> >> The University of Strathclyde is a charitable body, registered in
> >> Scotland,
> >> with registration number SC015263.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
> Murat Yakici
> Department of Computer & Information Sciences
> University of Strathclyde
> Glasgow, UK
> -------------------------------------------
> The University of Strathclyde is a charitable body, registered in Scotland,
> with registration number SC015263.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message