lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Reading Payloads
Date Tue, 23 Apr 2013 11:21:30 GMT
Actually, term vectors can store payloads now (LUCENE-1888), so if that
field was indexed with FieldType.setStoreTermVectorPayloads they should be
there.

But I suspect the TokenSources.getTokenStream API (which I think un-inverts
the term vectors to recreate the token stream = very slow?) wasn't fixed to
also carry the payloads through?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 23, 2013 at 7:10 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

> TermVectors are per-document and do not contain payloads. You are reading
> the per-document TermVectors which is a "small index" *stored* for each
> document as a binary blob. This blob only contains the terms of this
> document with its positions/offsets, but no payloads (offsets are used e.g.
> for highlighting).
>
> To retrieve payloads, you have to use the main TermsEnum and main posting
> lists, but this does *not* work per document. In general you would execute
> a query and then retrieve the payload for each hit while iterating the
> scorer (e.g. function queries can do this).
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Carsten Schnober [mailto:schnober@ids-mannheim.de]
> > Sent: Tuesday, April 23, 2013 1:04 PM
> > To: java-user
> > Subject: Reading Payloads
> >
> > Hi,
> > I'm trying to extract payloads from an index for specific tokens the
> following
> > way (inserting sample document number and term):
> >
> > Terms terms = reader.getTermVector(16504, "term"); TokenStream
> > tokenstream = TokenSources.getTokenStream(terms);
> > while (tokenstream.incrementToken()) {
> >   OffsetAttribute offset =
> tokenstream.getAttribute(OffsetAttribute.class);
> >   int start = offset.startOffset();
> >   int end = offset.endOffset();
> >   String token =
> > tokenstream.getAttribute(CharTermAttribute.class).toString();
> >
> >   PayloadAttribute payloadAttr =
> > tokenstream.addAttribute(PayloadAttribute.class);
> >   BytesRef payloadBytes = payloadAttr.getPayload();
> >
> >   ...
> > }
> >
> > This works fine for the OffsetAttribute and the CharTermAttribute, but
> > payloadAttr.getPayload() always returns null for all documents and all
> > tokens, unfortunately. However, I know that the payloads are stored in
> the
> > index as I can retrieve them through a SpanQuery with
> Spans.getPayload(). I
> > actually expect every token to carry a payload, as I'm my custom
> tokenizer
> > implementation has the following lines:
> >
> > public class KoraTokenizer extends Tokenizer {
> >   ...
> >   private PayloadAttribute payloadAttr =
> > addAttribute(PayloadAttribute.class);
> >   ...
> >   public boolean incrementToken() {
> >     ...
> >     payloadAttr.setPayload(new BytesRef(payloadString));
> >     ...
> >   }
> >   ...
> > }
> >
> > I've asserted that the payloadString variable is never an empty String
> and as I
> > said above, I can retrieve the Payloads with Spans.getPayload(). So what
> do I
> > do wrong in my
> > tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used
> > tokenstream.getAttribute() before as for the other attributes but this
> > obviously threw an IllegalArgumentException so I implemented the
> > recommendation given in the documentation and replaced it by
> > addAttribute().
> >
> > Thanks!
> > Carsten
> >
> >
> >
> >
> > --
> > Institut für Deutsche Sprache | http://www.ids-mannheim.de
> > Projekt KorAP                 | http://korap.ids-mannheim.de
> > Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
> > Korpusanalyseplattform der nächsten Generation Next Generation Corpus
> > Analysis Platform
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message