lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Reading Payloads
Date Tue, 23 Apr 2013 11:10:52 GMT
TermVectors are per-document and do not contain payloads. You are reading the per-document
TermVectors which is a "small index" *stored* for each document as a binary blob. This blob
only contains the terms of this document with its positions/offsets, but no payloads (offsets
are used e.g. for highlighting).

To retrieve payloads, you have to use the main TermsEnum and main posting lists, but this
does *not* work per document. In general you would execute a query and then retrieve the payload
for each hit while iterating the scorer (e.g. function queries can do this).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Carsten Schnober [mailto:schnober@ids-mannheim.de]
> Sent: Tuesday, April 23, 2013 1:04 PM
> To: java-user
> Subject: Reading Payloads
> 
> Hi,
> I'm trying to extract payloads from an index for specific tokens the following
> way (inserting sample document number and term):
> 
> Terms terms = reader.getTermVector(16504, "term"); TokenStream
> tokenstream = TokenSources.getTokenStream(terms);
> while (tokenstream.incrementToken()) {
>   OffsetAttribute offset = tokenstream.getAttribute(OffsetAttribute.class);
>   int start = offset.startOffset();
>   int end = offset.endOffset();
>   String token =
> tokenstream.getAttribute(CharTermAttribute.class).toString();
> 
>   PayloadAttribute payloadAttr =
> tokenstream.addAttribute(PayloadAttribute.class);
>   BytesRef payloadBytes = payloadAttr.getPayload();
> 
>   ...
> }
> 
> This works fine for the OffsetAttribute and the CharTermAttribute, but
> payloadAttr.getPayload() always returns null for all documents and all
> tokens, unfortunately. However, I know that the payloads are stored in the
> index as I can retrieve them through a SpanQuery with Spans.getPayload(). I
> actually expect every token to carry a payload, as I'm my custom tokenizer
> implementation has the following lines:
> 
> public class KoraTokenizer extends Tokenizer {
>   ...
>   private PayloadAttribute payloadAttr =
> addAttribute(PayloadAttribute.class);
>   ...
>   public boolean incrementToken() {
>     ...
>     payloadAttr.setPayload(new BytesRef(payloadString));
>     ...
>   }
>   ...
> }
> 
> I've asserted that the payloadString variable is never an empty String and as I
> said above, I can retrieve the Payloads with Spans.getPayload(). So what do I
> do wrong in my
> tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used
> tokenstream.getAttribute() before as for the other attributes but this
> obviously threw an IllegalArgumentException so I implemented the
> recommendation given in the documentation and replaced it by
> addAttribute().
> 
> Thanks!
> Carsten
> 
> 
> 
> 
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP                 | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation Next Generation Corpus
> Analysis Platform
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message