lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Drapkin (JIRA)" <>
Subject [jira] Created: (LUCENE-2508) Consolidate Highlighter implementations and a major refactor of the non-termvector highlighter
Date Tue, 22 Jun 2010 07:14:54 GMT
Consolidate Highlighter implementations and a major refactor of the non-termvector highlighter

                 Key: LUCENE-2508
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/highlighter
         Environment: irrelevant
            Reporter: Edward Drapkin
            Priority: Minor
             Fix For: 4.0
         Attachments: LUCENE-2508.patch

Originally, I had planned to create a contrib module to allow people to highlight multiple
documents in parallel, but after talking to Uwe in IRC about it, I realized that it was pretty
useless.  However, I was already sitting on an iterative highlighting algorithm that was much
faster (my tests show 20% - 40%) and more accurate and, based on that same IRC conversation,
I decided to not let all the work that I had done go to waste and try to contribute it back
again.  Uwe had mentioned that "More like this" detected term vectors when called and use
the term vector implementation when possible, if I recall correctly, so I decided to do that.

The patch that I've attached is my first stab at this.  It's not nearly complete and full
disclosure dictates that I say that it's not fully documented and there are not any unit tests
written.  I wanted to go ahead and open an issue to get some feedback on the approach that
I've taken as well as the fact that it exists will be a proverbial kick in my pants to continue
working on it.

In short, what I've changed:

* Completely rewritten the non-tv highlighter to be faster and cleaner.  There is some small
loss in functionality for now, namely the loss of the GradientHighlighter (I just haven't
done this yet) and the lack of exposure of TermFragments and their scores (I can expose this
if it is deemed necessary, this is one of the things I'd like feedback on). 
* Moved and to
a single package with a unified interface, search.highlight (with two sub-packages: search.highlight.termvector
and search.highlight.iterative, respectively).
* Unified the highlighted term formatting into a single interface: highlighter/Formatter and
both highlighters use this now.  

What I need to do before I personally would consider this finished:

* Finish documentation, most specifically on TermVectorHighlighter.  I haven't done this now
as I expect things to change up quite a bit before they're finalized and I really hate writing
documentation that goes to waste, but I do intend to complete this bullet :)
* "Flesh out" the API of search.highlight.Highlighter as it's very barebones right now
* Continue removing and consolidating duplicate functionality, like I've done with the highlighted
word tag generation.

What I think I need feedback on, before I can proceed:
* FastTermVectorHighlighter and the iterative highlighters need completely different sets
of information in order to work.  The approach I've taken is exposing a vectorHighlight method
in the unified interface and a iterativeHighlight method, as well as a single highlight method
that takes all the information needed for either of them and I'm unsure if this is the best
way to do this.
* The naming of things; I'm not sure if this is a big issue, or even an issue at all, but
I'd like to not break any conventions that may exist that I'm unaware of.
* How big of a deal is exposing the particular score of a segment from the highlighting interface
and does this need to be extended into the term vector highlighting as well?
* There are a lot of methods in the tv implementation that are marked depracted; since this
release will almost definitely break backwards compatibility anyway, are these safe to remove?
* Any other input anyone else may have :)

I'm going to continue to work on things that I can work on, at least unless someone tells
me I'm wasting my time and will look forward to hearing you guys' feedback! :)

As a sidenote because it does seem rather random that I would arbitrarily re-write a working
algorithm in the non-tv highlighter, I did it originally because I wanted to parallelize the
highlighting (which was a failed experiment) and simply to see if I could make the algorithm
faster, as I find that sort of thing particularly fun :)

As a second sidenote, if anyone would like an explanation of the algorithm for the highlighting
I devised, and why I feel that it's more accurate, I'd be happy to provide them with one (and
benchmarks as well).


This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message