lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: Using term offsets for hit highlighting
Date Wed, 23 May 2012 18:14:06 GMT
Hey Alan,


On Wed, May 23, 2012 at 6:46 PM, Alan Woodward
<alan.woodward@romseysoftware.co.uk> wrote:
> OK, so the most straightforward way to do that would be to change the signature to positions(boolean
needsPayloads, boolean needsOffsets), I guess.  This is a new API so it's not breaking anything.

yeah I'd think so. this is also consistent how we pull scorers & its
safe in terms of changes ie. you won't miss an API change vs. using a
struct like object. I am not sure how we expose the offsets yet but
for now lets make the tests pass. That should provide you a good and
straight forward start though. Don't worry about the API for now, we
are in a dev phase that doesn't need to produce a fixed API we will
straighten that out iteratively as we go.

>
> It'll be tomorrow morning before I have a proper go at this now (Cambridge Beer Festival
tonight…).  Is the mailing list the best place to discuss this, or is JIRA/IRC better?

patches should go on the issue and code discussions related to the
patches too. It might make sense to have discussion of a broader scope
on the dev list, decisions made on the list should be referenced on
the issue.  IRC might make sense too if you have some questions that
are better answered interactively. Yet, any decisions should also be
discussed here or on the issue. If something we discussed on IRC leads
to some design decisions its wise to repeat them on the issue so folks
can reproduce the decision making process. In any case if its IRC make
sure it #lucene-dev

looking forward to the patches...

simon
>
> On 23 May 2012, at 13:43, Simon Willnauer wrote:
>
>> hey alan,
>>
>> I added position iterator support to ConjunctionTermScorer and
>> committed it to the branch. All tests that don't rely on payloads are
>> passing in core. Previously we had to decide if we need positions up
>> front, the current code can pull them lazily which causes less changes
>> on the Scorer API. I think we should keep it that way, the only
>> problem is that we have currently now way to pass information to the
>> iterators if we need payloads or not. Same is true for offsets since
>> they are now in the index. I think it would be good if you could
>> tackle the payloads first and pass some info to the Scorer#positions()
>> method so we can pull the right thing.
>>
>> happy coding.
>>
>> simon
>>
>> On Wed, May 23, 2012 at 1:23 PM, Alan Woodward
>> <alan.woodward@romseysoftware.co.uk> wrote:
>>> Sweet, thanks Simon.  I'll have a go at getting some failing tests passing to
begin with.
>>>
>>> On 23 May 2012, at 11:59, Simon Willnauer wrote:
>>>
>>>> alan,
>>>>
>>>> I merged the branch manually and created a new branch from it. its
>>>> here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-2878
>>>> the branch compiles but lots of nocommits / todos
>>>>
>>>> if you have questions please ask I will help as much as I can
>>>>
>>>> simon
>>>>
>>>> On Tue, May 22, 2012 at 8:38 PM, Alan Woodward
>>>> <alan.woodward@romseysoftware.co.uk> wrote:
>>>>> Hey, I reckon I can have a decent go at getting the branch updated.  Is
it best to work this out as a patch applying to trunk?  Any patch that merges in all the
trunk changes to the branch is going to be absolutely massive…
>>>>>
>>>>> On 17 May 2012, at 13:15, Simon Willnauer wrote:
>>>>>
>>>>>> ok man. I will try to merge up the branch. I tell you this is going
to
>>>>>> be messy and it might not compile but I will make it reasonable so
you
>>>>>> can start.
>>>>>>
>>>>>> simon
>>>>>>
>>>>>> On Thu, May 17, 2012 at 8:03 AM, Alan Woodward
>>>>>> <alan.woodward@romseysoftware.co.uk> wrote:
>>>>>>> Sorry for vanishing for so long, life unexpectedly caught up
with me...  I'm going to have some time to look at this again next week though, if you're
interested in picking it up again.
>>>>>>>
>>>>>>> On 21 Mar 2012, at 09:02, Alan Woodward wrote:
>>>>>>>
>>>>>>>> That would be great, thanks!  I had a go at merging it last
night, but there are a *lot* of changes that I haven't got my head round yet, so it was getting
pretty messy.
>>>>>>>>
>>>>>>>> On 21 Mar 2012, at 08:49, Simon Willnauer wrote:
>>>>>>>>
>>>>>>>>> Alan, if you want I can just merge the branch up next
week and we
>>>>>>>>> iterate from there?
>>>>>>>>>
>>>>>>>>> simon
>>>>>>>>>
>>>>>>>>> On Tue, Mar 20, 2012 at 12:34 PM, Erick Erickson
>>>>>>>>> <erickerickson@gmail.com> wrote:
>>>>>>>>>> Yep, the first challenge is always getting the old
patch(es) to apply.....
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 20, 2012 at 4:09 AM, Alan Woodward
>>>>>>>>>> <alan.woodward@romseysoftware.co.uk> wrote:
>>>>>>>>>>> Thanks for all the offers of help!  It looks
as though most of the hard work has already been done, which is exactly where I like to pick
up projects.  :-)
>>>>>>>>>>>
>>>>>>>>>>> Maybe the best place to start would be for me
to rebase the branch against trunk, and see what still fits?  I think there have been some
fairly major changes in the internals since July last year.
>>>>>>>>>>>
>>>>>>>>>>> On 19 Mar 2012, at 17:07, Mike Sokolov wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I posted a patch with a Collector somewhat
similar to what you described, Alan - it's attached to one of the sub-issues https://issues.apache.org/jira/browse/LUCENE-3318.
  It is in a fairly complete "alpha" state, but has seen no production use of course, since
it relies on the remainder of the unfinished work in that branch.  It works by creating a
TokenStream based on match positions returned from the query and passing that to the existing
Highlighter.  Please feel free to get in touch if you decide to look into that and have questions.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -Mike
>>>>>>>>>>>>
>>>>>>>>>>>> On 03/19/2012 11:51 AM, Simon Willnauer wrote:
>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 4:50 PM, Uwe
Schindler<uwe@thetaphi.de>  wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Have you marked that for GSOC? Would
be a good idea!
>>>>>>>>>>>>>>
>>>>>>>>>>>>> yes I did
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>> Uwe Schindler
>>>>>>>>>>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>>>>>>>>>> http://www.thetaphi.de
>>>>>>>>>>>>>> eMail: uwe@thetaphi.de
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Simon Willnauer [mailto:simon.willnauer@googlemail.com]
>>>>>>>>>>>>>>> Sent: Monday, March 19, 2012
4:43 PM
>>>>>>>>>>>>>>> To: dev@lucene.apache.org
>>>>>>>>>>>>>>> Subject: Re: Using term offsets
for hit highlighting
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Alan, you made my day!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The branch is kind of outdated
but I looked at it lately and I can certainly help
>>>>>>>>>>>>>>> to get it up to speed. The feature
in that branch is quite a big one and its in a
>>>>>>>>>>>>>>> very early stage. Still I want
to encourage you to take a look and work on it. I
>>>>>>>>>>>>>>> promise all my help with the
issues!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> let me know if you have questions!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> simon
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 3:52
PM, Alan Woodward
>>>>>>>>>>>>>>> <alan.woodward@romseysoftware.co.uk>
 wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cool, thanks Robert.  I'll
take a look at the JIRA ticket.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 19 Mar 2012, at 14:44,
Robert Muir wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Mar 19, 2012
at 10:38 AM, Alan Woodward
>>>>>>>>>>>>>>>>> <alan.woodward@romseysoftware.co.uk>
 wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The project I'm currently
working on requires the reporting of exact
>>>>>>>>>>>>>>>>>> hit positions from
some pretty hairy queries, not all of which are
>>>>>>>>>>>>>>>>>> covered by the existing
highlighter modules.  I'm working round this
>>>>>>>>>>>>>>>>>> by translating everything
into SpanQueries, and using the getSpans()
>>>>>>>>>>>>>>>>>> method to locate
hits (I've extended the Spans interface to make
>>>>>>>>>>>>>>>>>> term offsets available
- see
>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/LUCENE-3826).
 This works for
>>>>>>>>>>>>>>>>>> our use-case, but
isn't terribly efficient, and obviously isn't applicable to
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> non-Span queries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I've seen a bit of
chatter on the list about using term offsets to
>>>>>>>>>>>>>>>>>> provide accurate
highlighting in Lucene.  I'm going to have a couple
>>>>>>>>>>>>>>>>>> of weeks free in
April, and I thought I might have a go at
>>>>>>>>>>>>>>>>>> implementing this.
 Mainly I'm wondering if there's already been
>>>>>>>>>>>>>>>>>> thoughts about how
to do it.  My current thoughts are to somehow
>>>>>>>>>>>>>>>>>> extend the Weight
and Scorer interface to make term offsets
>>>>>>>>>>>>>>>>>> available; to get
highlights for a given set of documents, you'd
>>>>>>>>>>>>>>>>>> essentially run the
query again, with a filter on just the documents
>>>>>>>>>>>>>>>>>> you want highlighted,
and have a custom collector that gets the term
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> offsets in place of the scores.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Alan, Simon started
some initial work on
>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/LUCENE-2878
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Some work and prototypes
were done in a branch, but it might be
>>>>>>>>>>>>>>>>> lagging behind trunk
a bit.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Additionally at the time
it was first done, I think we didn't yet
>>>>>>>>>>>>>>>>> support offsets in the
postings lists.
>>>>>>>>>>>>>>>>> We've since added this
and several codecs support it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> lucidimagination.com
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
dev-unsubscribe@lucene.apache.org For
>>>>>>>>>>>>>>>>> additional commands,
e-mail: dev-help@lucene.apache.org
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For
>>>>>>>>>>>>>>>> additional commands, e-mail:
dev-help@lucene.apache.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional
>>>>>>>>>>>>>>> commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message