lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-794) Beginnings of a span based highlighter
Date Sun, 04 Feb 2007 23:37:05 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12470098
] 

Mark Miller commented on LUCENE-794:
------------------------------------

Sorry about all that Mark H. This was literally just some test code that I quickly shoved
into an api similar to your existing highlighter. If you decided that it should be something
considered on it's own I would certainly have quite a bit further to go. Mostly I just put
it up for your evaluation on extending the current highlighter with this highlight method.

>1) Fieldname "contents" shouldn't be hardcoded into the Highlighter - different analyzers
can behave differently for different fields (see >PerFieldAnalyzerWrapper). Either pass
a fieldname parameter or do as the existing highlighter does and take a TokenStream. The latter
approach >has the advantage of being able to avoid re-analysis and make use of any stored
TermVectors (see TokenSources.java)

I don't have a great solution for this right now. I need to read the TokenStream at least
twice due to the MemoryIndex extracting the spans. Unfortunately, it seems I can copy the
tokens to a list or pass them to the MemoryIndex -- I cannot do both. The MemoryIndex is also
looking for a field name...so while I changed the api to take a TokenStream, I have not resolved
also needing the field name. I am hoping you have some good comments. To get around reading
the TokenStream twice I used the horribly hackey but quick-for-me method of adding a method
to MemoryIndex that accepts a List of Tokens. Any ideas?

2) Analyzers which produce overlapping tokens (see Synonym analyzer in existing highlighter
Junit test) are problematic in the existing code. I remember the "TokenGroup" class in the
existing highlighter was an approach to help cater for these "overlap" scenarios.

I always attack this last <G>. Seems a simple fix: if position increment equals 0 skip
printing out the token. It passes your test which I have added to my test code, but I am not
totally confident it is perfect yet.

3) Without wishing to resurrect the whole 1.4 vs 1.5 debate I beleive Lucene still targets
Java 1.4.

Just me being lazy. I swear I have seen Contrib stuff that says 1.5. I have gone through and
stripped out all of the 1.4 except for StringBuilder for the moment.

>To rectify these points it's not clear to me if it would be quicker to use your code or
adapt the existing highlighter code to use spans.
>Thoughts? 

Depends entirely on what you think. I am sure I can fix all of the issues you mention (with
a little advice <G>), but I am pretty new to this type of thing and perhaps you just
want to start from scratch in order to achieve span highlighting with the existing highlighter.
It may just be that the way I am doing this is not very compatible with the way you currently
fragment and score.

I have added an updated Highlighter.java and HighlighterTest.java. The MemoryIndex problem
remains...so it either has to be fixed or the modified MemoryIndex must be used.

- Mark m

> Beginnings of a span based highlighter
> --------------------------------------
>
>                 Key: LUCENE-794
>                 URL: https://issues.apache.org/jira/browse/LUCENE-794
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Other
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: DefaultEncoder.java, Encoder.java, Formatter.java, Highlighter.java,
Highlighter.java, HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, QuerySpansExtractor.java,
SimpleFormatter.java
>
>
> This is some test code to start the work of adding a span based highlighting approach
to the existing highlighter in contrib. See http://issues.apache.org/jira/browse/LUCENE-403
for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message