lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Woodward <alan.woodw...@romseysoftware.co.uk>
Subject Re: Using term offsets for hit highlighting
Date Wed, 23 May 2012 11:23:44 GMT
Sweet, thanks Simon.  I'll have a go at getting some failing tests passing to begin with.

On 23 May 2012, at 11:59, Simon Willnauer wrote:

> alan,
> 
> I merged the branch manually and created a new branch from it. its
> here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-2878
> the branch compiles but lots of nocommits / todos
> 
> if you have questions please ask I will help as much as I can
> 
> simon
> 
> On Tue, May 22, 2012 at 8:38 PM, Alan Woodward
> <alan.woodward@romseysoftware.co.uk> wrote:
>> Hey, I reckon I can have a decent go at getting the branch updated.  Is it best to
work this out as a patch applying to trunk?  Any patch that merges in all the trunk changes
to the branch is going to be absolutely massiveā€¦
>> 
>> On 17 May 2012, at 13:15, Simon Willnauer wrote:
>> 
>>> ok man. I will try to merge up the branch. I tell you this is going to
>>> be messy and it might not compile but I will make it reasonable so you
>>> can start.
>>> 
>>> simon
>>> 
>>> On Thu, May 17, 2012 at 8:03 AM, Alan Woodward
>>> <alan.woodward@romseysoftware.co.uk> wrote:
>>>> Sorry for vanishing for so long, life unexpectedly caught up with me... 
I'm going to have some time to look at this again next week though, if you're interested in
picking it up again.
>>>> 
>>>> On 21 Mar 2012, at 09:02, Alan Woodward wrote:
>>>> 
>>>>> That would be great, thanks!  I had a go at merging it last night, but
there are a *lot* of changes that I haven't got my head round yet, so it was getting pretty
messy.
>>>>> 
>>>>> On 21 Mar 2012, at 08:49, Simon Willnauer wrote:
>>>>> 
>>>>>> Alan, if you want I can just merge the branch up next week and we
>>>>>> iterate from there?
>>>>>> 
>>>>>> simon
>>>>>> 
>>>>>> On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson
>>>>>> <erickerickson@gmail.com> wrote:
>>>>>>> Yep, the first challenge is always getting the old patch(es)
to apply.....
>>>>>>> 
>>>>>>> On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
>>>>>>> <alan.woodward@romseysoftware.co.uk> wrote:
>>>>>>>> Thanks for all the offers of help!  It looks as though most
of the hard work has already been done, which is exactly where I like to pick up projects.
 :-)
>>>>>>>> 
>>>>>>>> Maybe the best place to start would be for me to rebase the
branch against trunk, and see what still fits?  I think there have been some fairly major
changes in the internals since July last year.
>>>>>>>> 
>>>>>>>> On 19 Mar 2012, at 17:07, Mike Sokolov wrote:
>>>>>>>> 
>>>>>>>>> I posted a patch with a Collector somewhat similar to
what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318.
  It is in a fairly complete "alpha" state, but has seen no production use of course, since
it relies on the remainder of the unfinished work in that branch.  It works by creating a
TokenStream based on match positions returned from the query and passing that to the existing
Highlighter.  Please feel free to get in touch if you decide to look into that and have questions.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -Mike
>>>>>>>>> 
>>>>>>>>> On 03/19/2012 11:51 AM, Simon Willnauer wrote:
>>>>>>>>>> On Mon, Mar 19, 2012 at 4:50 PM, Uwe Schindler<uwe@thetaphi.de>
 wrote:
>>>>>>>>>> 
>>>>>>>>>>> Have you marked that for GSOC? Would be a good
idea!
>>>>>>>>>>> 
>>>>>>>>>> yes I did
>>>>>>>>>> 
>>>>>>>>>>> -----
>>>>>>>>>>> Uwe Schindler
>>>>>>>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>>>>>>> http://www.thetaphi.de
>>>>>>>>>>> eMail: uwe@thetaphi.de
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
>>>>>>>>>>>> Sent: Monday, March 19, 2012 4:43 PM
>>>>>>>>>>>> To: dev@lucene.apache.org
>>>>>>>>>>>> Subject: Re: Using term offsets for hit highlighting
>>>>>>>>>>>> 
>>>>>>>>>>>> Alan, you made my day!
>>>>>>>>>>>> 
>>>>>>>>>>>> The branch is kind of outdated but I looked
at it lately and I can certainly help
>>>>>>>>>>>> to get it up to speed. The feature in that
branch is quite a big one and its in a
>>>>>>>>>>>> very early stage. Still I want to encourage
you to take a look and work on it. I
>>>>>>>>>>>> promise all my help with the issues!
>>>>>>>>>>>> 
>>>>>>>>>>>> let me know if you have questions!
>>>>>>>>>>>> 
>>>>>>>>>>>> simon
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Mar 19, 2012 at 3:52 PM, Alan Woodward
>>>>>>>>>>>> <alan.woodward@romseysoftware.co.uk>
 wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Cool, thanks Robert.  I'll take a look
at the JIRA ticket.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 19 Mar 2012, at 14:44, Robert Muir
wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 10:38 AM,
Alan Woodward
>>>>>>>>>>>>>> <alan.woodward@romseysoftware.co.uk>
 wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The project I'm currently working
on requires the reporting of exact
>>>>>>>>>>>>>>> hit positions from some pretty
hairy queries, not all of which are
>>>>>>>>>>>>>>> covered by the existing highlighter
modules.  I'm working round this
>>>>>>>>>>>>>>> by translating everything into
SpanQueries, and using the getSpans()
>>>>>>>>>>>>>>> method to locate hits (I've extended
the Spans interface to make
>>>>>>>>>>>>>>> term offsets available - see
>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/LUCENE-3826).
 This works for
>>>>>>>>>>>>>>> our use-case, but isn't terribly
efficient, and obviously isn't applicable to
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> non-Span queries.
>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I've seen a bit of chatter on
the list about using term offsets to
>>>>>>>>>>>>>>> provide accurate highlighting
in Lucene.  I'm going to have a couple
>>>>>>>>>>>>>>> of weeks free in April, and I
thought I might have a go at
>>>>>>>>>>>>>>> implementing this.  Mainly I'm
wondering if there's already been
>>>>>>>>>>>>>>> thoughts about how to do it.
 My current thoughts are to somehow
>>>>>>>>>>>>>>> extend the Weight and Scorer
interface to make term offsets
>>>>>>>>>>>>>>> available; to get highlights
for a given set of documents, you'd
>>>>>>>>>>>>>>> essentially run the query again,
with a filter on just the documents
>>>>>>>>>>>>>>> you want highlighted, and have
a custom collector that gets the term
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> offsets in place of the scores.
>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Alan, Simon started some initial
work on
>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/LUCENE-2878
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Some work and prototypes were done
in a branch, but it might be
>>>>>>>>>>>>>> lagging behind trunk a bit.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Additionally at the time it was first
done, I think we didn't yet
>>>>>>>>>>>>>> support offsets in the postings lists.
>>>>>>>>>>>>>> We've since added this and several
codecs support it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> lucidimagination.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For
>>>>>>>>>>>>>> additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For
>>>>>>>>>>>>> additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional
>>>>>>>>>>>> commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>> 
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>> 
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message