lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation
Date Wed, 06 Apr 2011 22:18:06 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016580#comment-13016580
] 

Uwe Schindler commented on SOLR-2400:
-------------------------------------

Hi Stefan,

sorry for missing your last response.

About the raw term: The raw term is only shown by solr currently, if the term is only binary
(like numerics) or similar (when the FieldType does some transformation like with the deprecated
Sortable*) fields. I just mentioned it as example that I was missing some attributes in your
example output. To solve your problem it is of no use.

I already mentioned:
{quote}One possibility to handle the thing might be the char offset in the original text,
because that the req handler may use the character offset of begin and end of the token in
the original stream instead of the token position, but this is likely to break for lots of
TokenFilters (WordDelimiterFilter would work as long as you don't do stemming before...).
The problem is incorrect handling of offset calculation (also leading to bugs in highlighting)
when the inserted terms are longer than their originals.{quote}

This might be your only chance (using the OffsetAttribute), but it is likely to break. What
you want to have is not possible with the analysis API of Lucene, as some information is missing
(as not needed during analysis - the absolute positions are not important for the indexer,
so TokenStreams don't preserve them.

A possibility to preserve the original positions would be a trick in the analysis RequestHandler:
It could insert a Fake TokenFilter directly after the Tokenizer, that adds an additional Attribute
with the absolute position (incremented on each call to input.incrementToken()). This could
be a hack to achieve what you want.

Maybe I can help you, but that needs some refactoring in AnalysisRequestHandlers, but might
be a good idea.

> FieldAnalysisRequestHandler; add information about token-relation
> -----------------------------------------------------------------
>
>                 Key: SOLR-2400
>                 URL: https://issues.apache.org/jira/browse/SOLR-2400
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Stefan Matheis (steffkes)
>            Priority: Minor
>         Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 110303_FieldAnalysisRequestHandler_view.png
>
>
> The XML-Output (simplified example attached) is missing one small information .. which
could be very useful to build an nice Analysis-Output, and that's "Token-Relation" (if there
is special/correct word for this, please correct me).
> Meaning, that is actually not possible to "follow" the Analysis-Process (completly) while
the Tokenizers/Filters will drop out Tokens (f.e. StopWord) or split it into multiple Tokens
(f.e. WordDelimiter).
> Would it be possible to include this Information? If so, it would be possible to create
an improved Analysis-Page for the new Solr Admin (SOLR-2399) - short scribble attached

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message