lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-2508) Consolidate Highlighter implementations and a major refactor of the non-termvector highlighter
Date Thu, 09 May 2013 23:06:07 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-2508:
----------------------------------

    Fix Version/s:     (was: 4.3)
                   4.4
    
> Consolidate Highlighter implementations and a major refactor of the non-termvector highlighter
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2508
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2508
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/highlighter
>         Environment: irrelevant
>            Reporter: Edward Drapkin
>            Priority: Minor
>              Labels: highlight, search
>             Fix For: 4.4
>
>         Attachments: LUCENE-2508.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Originally, I had planned to create a contrib module to allow people to highlight multiple
documents in parallel, but after talking to Uwe in IRC about it, I realized that it was pretty
useless.  However, I was already sitting on an iterative highlighting algorithm that was much
faster (my tests show 20% - 40%) and more accurate and, based on that same IRC conversation,
I decided to not let all the work that I had done go to waste and try to contribute it back
again.  Uwe had mentioned that "More like this" detected term vectors when called and use
the term vector implementation when possible, if I recall correctly, so I decided to do that.
> The patch that I've attached is my first stab at this.  It's not nearly complete and
full disclosure dictates that I say that it's not fully documented and there are not any unit
tests written.  I wanted to go ahead and open an issue to get some feedback on the approach
that I've taken as well as the fact that it exists will be a proverbial kick in my pants to
continue working on it.
> In short, what I've changed:
> * Completely rewritten the non-tv highlighter to be faster and cleaner.  There is some
small loss in functionality for now, namely the loss of the GradientHighlighter (I just haven't
done this yet) and the lack of exposure of TermFragments and their scores (I can expose this
if it is deemed necessary, this is one of the things I'd like feedback on). 
> * Moved org.apache.lucene.search.vectorhighlight and org.apache.lucene.search.highlight
to a single package with a unified interface, search.highlight (with two sub-packages: search.highlight.termvector
and search.highlight.iterative, respectively).
> * Unified the highlighted term formatting into a single interface: highlighter/Formatter
and both highlighters use this now.  
> What I need to do before I personally would consider this finished:
> * Finish documentation, most specifically on TermVectorHighlighter.  I haven't done this
now as I expect things to change up quite a bit before they're finalized and I really hate
writing documentation that goes to waste, but I do intend to complete this bullet :)
> * "Flesh out" the API of search.highlight.Highlighter as it's very barebones right now
> * Continue removing and consolidating duplicate functionality, like I've done with the
highlighted word tag generation.
> What I think I need feedback on, before I can proceed:
> * FastTermVectorHighlighter and the iterative highlighters need completely different
sets of information in order to work.  The approach I've taken is exposing a vectorHighlight
method in the unified interface and a iterativeHighlight method, as well as a single highlight
method that takes all the information needed for either of them and I'm unsure if this is
the best way to do this.
> * The naming of things; I'm not sure if this is a big issue, or even an issue at all,
but I'd like to not break any conventions that may exist that I'm unaware of.
> * How big of a deal is exposing the particular score of a segment from the highlighting
interface and does this need to be extended into the term vector highlighting as well?
> * There are a lot of methods in the tv implementation that are marked depracted; since
this release will almost definitely break backwards compatibility anyway, are these safe to
remove?
> * Any other input anyone else may have :)
> I'm going to continue to work on things that I can work on, at least unless someone tells
me I'm wasting my time and will look forward to hearing you guys' feedback! :)
> As a sidenote because it does seem rather random that I would arbitrarily re-write a
working algorithm in the non-tv highlighter, I did it originally because I wanted to parallelize
the highlighting (which was a failed experiment) and simply to see if I could make the algorithm
faster, as I find that sort of thing particularly fun :)
> As a second sidenote, if anyone would like an explanation of the algorithm for the highlighting
I devised, and why I feel that it's more accurate, I'd be happy to provide them with one (and
benchmarks as well).
> Thanks,
> Eddie

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message