Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E6E0295F2 for ; Tue, 22 May 2012 18:39:30 +0000 (UTC) Received: (qmail 35697 invoked by uid 500); 22 May 2012 18:39:29 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 35627 invoked by uid 500); 22 May 2012 18:39:29 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 35619 invoked by uid 99); 22 May 2012 18:39:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 May 2012 18:39:29 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [93.93.131.52] (HELO haggis.mythic-beasts.com) (93.93.131.52) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 May 2012 18:39:24 +0000 Received: from [92.23.94.24] (helo=[192.168.0.3]) by haggis.mythic-beasts.com with esmtpsa (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.69) (envelope-from ) id 1SWtyz-00008n-FA; Tue, 22 May 2012 19:39:02 +0100 Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Apple Message framework v1278) Subject: Re: Using term offsets for hit highlighting From: Alan Woodward In-Reply-To: Date: Tue, 22 May 2012 19:38:55 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <0A4EDBA8-1EBF-4D4D-9BAD-587C850ADF7E@romseysoftware.co.uk> References: <0934A7E5-6F45-4153-A3EC-5CEB3F88D46C@romseysoftware.co.uk> <48FBD7F2-EA1D-4140-A303-1D500309BAC2@romseysoftware.co.uk> <003001cd05e7$f1302160$d3906420$@thetaphi.de> <4F6767E4.6010607@ifactory.com> <88430181-D105-47AD-BC27-4022B83C0933@romseysoftware.co.uk> <2F0940F2-045F-4E3F-AF87-A04DB16B3A58@romseysoftware.co.uk> <1A890BD8-E3B3-4240-9172-FD626F46B025@romseysoftware.co.uk> To: dev@lucene.apache.org, simon.willnauer@gmail.com X-Mailer: Apple Mail (2.1278) X-BlackCat-Spam-Score: -13 X-Mythic-Debug: Threshold = On = X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-1.4 Hey, I reckon I can have a decent go at getting the branch updated. Is = it best to work this out as a patch applying to trunk? Any patch that = merges in all the trunk changes to the branch is going to be absolutely = massive=85 On 17 May 2012, at 13:15, Simon Willnauer wrote: > ok man. I will try to merge up the branch. I tell you this is going to > be messy and it might not compile but I will make it reasonable so you > can start. >=20 > simon >=20 > On Thu, May 17, 2012 at 8:03 AM, Alan Woodward > wrote: >> Sorry for vanishing for so long, life unexpectedly caught up with = me... I'm going to have some time to look at this again next week = though, if you're interested in picking it up again. >>=20 >> On 21 Mar 2012, at 09:02, Alan Woodward wrote: >>=20 >>> That would be great, thanks! I had a go at merging it last night, = but there are a *lot* of changes that I haven't got my head round yet, = so it was getting pretty messy. >>>=20 >>> On 21 Mar 2012, at 08:49, Simon Willnauer wrote: >>>=20 >>>> Alan, if you want I can just merge the branch up next week and we >>>> iterate from there? >>>>=20 >>>> simon >>>>=20 >>>> On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson >>>> wrote: >>>>> Yep, the first challenge is always getting the old patch(es) to = apply..... >>>>>=20 >>>>> On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward >>>>> wrote: >>>>>> Thanks for all the offers of help! It looks as though most of = the hard work has already been done, which is exactly where I like to = pick up projects. :-) >>>>>>=20 >>>>>> Maybe the best place to start would be for me to rebase the = branch against trunk, and see what still fits? I think there have been = some fairly major changes in the internals since July last year. >>>>>>=20 >>>>>> On 19 Mar 2012, at 17:07, Mike Sokolov wrote: >>>>>>=20 >>>>>>> I posted a patch with a Collector somewhat similar to what you = described, Alan - it's attached to one of the sub-issues = https://issues.apache.org/jira/browse/LUCENE-3318. It is in a fairly = complete "alpha" state, but has seen no production use of course, since = it relies on the remainder of the unfinished work in that branch. It = works by creating a TokenStream based on match positions returned from = the query and passing that to the existing Highlighter. Please feel = free to get in touch if you decide to look into that and have questions. >>>>>>>=20 >>>>>>>=20 >>>>>>> -Mike >>>>>>>=20 >>>>>>> On 03/19/2012 11:51 AM, Simon Willnauer wrote: >>>>>>>> On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindler = wrote: >>>>>>>>=20 >>>>>>>>> Have you marked that for GSOC? Would be a good idea! >>>>>>>>>=20 >>>>>>>> yes I did >>>>>>>>=20 >>>>>>>>> ----- >>>>>>>>> Uwe Schindler >>>>>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>>>>>>> http://www.thetaphi.de >>>>>>>>> eMail: uwe@thetaphi.de >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com] >>>>>>>>>> Sent: Monday, March 19, 2012 4:43 PM >>>>>>>>>> To: dev@lucene.apache.org >>>>>>>>>> Subject: Re: Using term offsets for hit highlighting >>>>>>>>>>=20 >>>>>>>>>> Alan, you made my day! >>>>>>>>>>=20 >>>>>>>>>> The branch is kind of outdated but I looked at it lately and = I can certainly help >>>>>>>>>> to get it up to speed. The feature in that branch is quite a = big one and its in a >>>>>>>>>> very early stage. Still I want to encourage you to take a = look and work on it. I >>>>>>>>>> promise all my help with the issues! >>>>>>>>>>=20 >>>>>>>>>> let me know if you have questions! >>>>>>>>>>=20 >>>>>>>>>> simon >>>>>>>>>>=20 >>>>>>>>>> On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward >>>>>>>>>> wrote: >>>>>>>>>>=20 >>>>>>>>>>> Cool, thanks Robert. I'll take a look at the JIRA ticket. >>>>>>>>>>>=20 >>>>>>>>>>> On 19 Mar 2012, at 14:44, Robert Muir wrote: >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>>> On Mon, Mar 19, 2012 at 10:38 AM, Alan Woodward >>>>>>>>>>>> wrote: >>>>>>>>>>>>=20 >>>>>>>>>>>>> Hello, >>>>>>>>>>>>>=20 >>>>>>>>>>>>> The project I'm currently working on requires the = reporting of exact >>>>>>>>>>>>> hit positions from some pretty hairy queries, not all of = which are >>>>>>>>>>>>> covered by the existing highlighter modules. I'm working = round this >>>>>>>>>>>>> by translating everything into SpanQueries, and using the = getSpans() >>>>>>>>>>>>> method to locate hits (I've extended the Spans interface = to make >>>>>>>>>>>>> term offsets available - see >>>>>>>>>>>>> https://issues.apache.org/jira/browse/LUCENE-3826). This = works for >>>>>>>>>>>>> our use-case, but isn't terribly efficient, and obviously = isn't applicable to >>>>>>>>>>>>>=20 >>>>>>>>>> non-Span queries. >>>>>>>>>>=20 >>>>>>>>>>>>> I've seen a bit of chatter on the list about using term = offsets to >>>>>>>>>>>>> provide accurate highlighting in Lucene. I'm going to = have a couple >>>>>>>>>>>>> of weeks free in April, and I thought I might have a go at >>>>>>>>>>>>> implementing this. Mainly I'm wondering if there's = already been >>>>>>>>>>>>> thoughts about how to do it. My current thoughts are to = somehow >>>>>>>>>>>>> extend the Weight and Scorer interface to make term = offsets >>>>>>>>>>>>> available; to get highlights for a given set of documents, = you'd >>>>>>>>>>>>> essentially run the query again, with a filter on just the = documents >>>>>>>>>>>>> you want highlighted, and have a custom collector that = gets the term >>>>>>>>>>>>>=20 >>>>>>>>>> offsets in place of the scores. >>>>>>>>>>=20 >>>>>>>>>>>>>=20 >>>>>>>>>>>> Hi Alan, Simon started some initial work on >>>>>>>>>>>> https://issues.apache.org/jira/browse/LUCENE-2878 >>>>>>>>>>>>=20 >>>>>>>>>>>> Some work and prototypes were done in a branch, but it = might be >>>>>>>>>>>> lagging behind trunk a bit. >>>>>>>>>>>>=20 >>>>>>>>>>>> Additionally at the time it was first done, I think we = didn't yet >>>>>>>>>>>> support offsets in the postings lists. >>>>>>>>>>>> We've since added this and several codecs support it. >>>>>>>>>>>>=20 >>>>>>>>>>>> -- >>>>>>>>>>>> lucidimagination.com >>>>>>>>>>>>=20 >>>>>>>>>>>> = --------------------------------------------------------------------- >>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org = For >>>>>>>>>>>> additional commands, e-mail: dev-help@lucene.apache.org >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> = --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org = For >>>>>>>>>>> additional commands, e-mail: dev-help@lucene.apache.org >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>> = --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For = additional >>>>>>>>>> commands, e-mail: dev-help@lucene.apache.org >>>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>> = --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org >>>>>>>>=20 >>>>>>>>=20 >>>>>>>=20 >>>>>>> = --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org >>>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>> = --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >>>>>> For additional commands, e-mail: dev-help@lucene.apache.org >>>>>>=20 >>>>>=20 >>>>> = --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >>>>> For additional commands, e-mail: dev-help@lucene.apache.org >>>>>=20 >>>>=20 >>>> = --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: dev-help@lucene.apache.org >>>>=20 >>>=20 >>>=20 >>> = --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: dev-help@lucene.apache.org >>>=20 >>=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: dev-help@lucene.apache.org >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org