Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 83AFFF5E2 for ; Thu, 9 May 2013 23:08:59 +0000 (UTC) Received: (qmail 79043 invoked by uid 500); 9 May 2013 23:06:07 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 78969 invoked by uid 500); 9 May 2013 23:06:07 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 78930 invoked by uid 99); 9 May 2013 23:06:07 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 May 2013 23:06:07 +0000 Date: Thu, 9 May 2013 23:06:07 +0000 (UTC) From: "Uwe Schindler (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (LUCENE-2508) Consolidate Highlighter implementations and a major refactor of the non-termvector highlighter MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2508: ---------------------------------- Fix Version/s: (was: 4.3) 4.4 > Consolidate Highlighter implementations and a major refactor of the non-termvector highlighter > ---------------------------------------------------------------------------------------------- > > Key: LUCENE-2508 > URL: https://issues.apache.org/jira/browse/LUCENE-2508 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/highlighter > Environment: irrelevant > Reporter: Edward Drapkin > Priority: Minor > Labels: highlight, search > Fix For: 4.4 > > Attachments: LUCENE-2508.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > Originally, I had planned to create a contrib module to allow people to highlight multiple documents in parallel, but after talking to Uwe in IRC about it, I realized that it was pretty useless. However, I was already sitting on an iterative highlighting algorithm that was much faster (my tests show 20% - 40%) and more accurate and, based on that same IRC conversation, I decided to not let all the work that I had done go to waste and try to contribute it back again. Uwe had mentioned that "More like this" detected term vectors when called and use the term vector implementation when possible, if I recall correctly, so I decided to do that. > The patch that I've attached is my first stab at this. It's not nearly complete and full disclosure dictates that I say that it's not fully documented and there are not any unit tests written. I wanted to go ahead and open an issue to get some feedback on the approach that I've taken as well as the fact that it exists will be a proverbial kick in my pants to continue working on it. > In short, what I've changed: > * Completely rewritten the non-tv highlighter to be faster and cleaner. There is some small loss in functionality for now, namely the loss of the GradientHighlighter (I just haven't done this yet) and the lack of exposure of TermFragments and their scores (I can expose this if it is deemed necessary, this is one of the things I'd like feedback on). > * Moved org.apache.lucene.search.vectorhighlight and org.apache.lucene.search.highlight to a single package with a unified interface, search.highlight (with two sub-packages: search.highlight.termvector and search.highlight.iterative, respectively). > * Unified the highlighted term formatting into a single interface: highlighter/Formatter and both highlighters use this now. > What I need to do before I personally would consider this finished: > * Finish documentation, most specifically on TermVectorHighlighter. I haven't done this now as I expect things to change up quite a bit before they're finalized and I really hate writing documentation that goes to waste, but I do intend to complete this bullet :) > * "Flesh out" the API of search.highlight.Highlighter as it's very barebones right now > * Continue removing and consolidating duplicate functionality, like I've done with the highlighted word tag generation. > What I think I need feedback on, before I can proceed: > * FastTermVectorHighlighter and the iterative highlighters need completely different sets of information in order to work. The approach I've taken is exposing a vectorHighlight method in the unified interface and a iterativeHighlight method, as well as a single highlight method that takes all the information needed for either of them and I'm unsure if this is the best way to do this. > * The naming of things; I'm not sure if this is a big issue, or even an issue at all, but I'd like to not break any conventions that may exist that I'm unaware of. > * How big of a deal is exposing the particular score of a segment from the highlighting interface and does this need to be extended into the term vector highlighting as well? > * There are a lot of methods in the tv implementation that are marked depracted; since this release will almost definitely break backwards compatibility anyway, are these safe to remove? > * Any other input anyone else may have :) > I'm going to continue to work on things that I can work on, at least unless someone tells me I'm wasting my time and will look forward to hearing you guys' feedback! :) > As a sidenote because it does seem rather random that I would arbitrarily re-write a working algorithm in the non-tv highlighter, I did it originally because I wanted to parallelize the highlighting (which was a failed experiment) and simply to see if I could make the algorithm faster, as I find that sort of thing particularly fun :) > As a second sidenote, if anyone would like an explanation of the algorithm for the highlighting I devised, and why I feel that it's more accurate, I'd be happy to provide them with one (and benchmarks as well). > Thanks, > Eddie -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org