lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Shalyminov <ishalymi...@yandex-team.ru>
Subject Re: Indexing documents with multiple field values
Date Wed, 02 Oct 2013 17:26:06 GMT
Hi again!

Here is my problem in more detail: in addition to indexing, I need the multi-value field to
be stored as-is. And if I pass it into the analyzer as multiple atomic tokens, it stores only
the first of them.
What do I need to do to my custom analyzer to make it store all the atomic tokens concatenated
eventually?

-- 
Igor

27.09.2013, 18:12, "Igor Shalyminov" <ishalyminov@yandex-team.ru>:
> Hello!
>
> I have really long document field values. Tokens of these fields are of the form: word|payload|position_increment.
(I need to control position increments and payload manually.)
> I collect these compound tokens for the entire document, then join them with a '\t',
and then pass this string to my custom analyzer.
> (For the really long field strings something breaks in the UnicodeUtil.UTF16toUTF8()
with ArrayOutOfBoundsException).
>
> The analyzer is just the following:
>
> class AmbiguousTokenAnalyzer extends Analyzer {
>     private PayloadEncoder encoder = new IntegerEncoder();
>
>     @Override
>     protected TokenStreamComponents createComponents(String fieldName, Reader reader)
{
>         Tokenizer source = new DelimiterTokenizer('\t', EngineInfo.ENGINE_VERSION,
reader);
>         TokenStream sink = new DelimitedPositionIncrementFilter(source, '|');
>         sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder);
>         sink.addAttribute(OffsetAttribute.class);
>         sink.addAttribute(CharTermAttribute.class);
>         sink.addAttribute(PayloadAttribute.class);
>         sink.addAttribute(PositionIncrementAttribute.class);
>         return new TokenStreamComponents(source, sink);
>     }
> }
>
> CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have 'incrementToken'
method where the rightmost "|aaa" part of a token is processed.
>
> The field is configured as:
>         attributeFieldType.setIndexed(true);
>         attributeFieldType.setStored(true);
>         attributeFieldType.setOmitNorms(true);
>         attributeFieldType.setTokenized(true);
>         attributeFieldType.setStoreTermVectorOffsets(true);
>         attributeFieldType.setStoreTermVectorPositions(true);
>         attributeFieldType.setStoreTermVectors(true);
>         attributeFieldType.setStoreTermVectorPayloads(true);
>
> The problem is, if I pass to the analyzer the field itself (one huge string - via document.add(...)
), it works OK, but if I pass token after token, something breaks at the search stage.
> As I read somewhere, these two ways must be the same from the resulting index point of
view. Maybe my analyzer misses something?
>
> --
> Best Regards,
> Igor Shalyminov
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message