lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ophir Cohen <oph...@gmail.com>
Subject Re: Fwd: Payloads API and support
Date Tue, 01 Feb 2011 13:30:19 GMT
Hi All,

I started in the users mailing but it looks to me that the dev mailing 
list is better.


As I wrote below, I'm using Lucene for more than 5 years and I think 
you're doing a great work!


Lately I encountered a problem concerns the payloads.

Please read first the mail below and then this one.

Thanks,


Some more thoughts from today:

As far as I can see there isn't any possibility to get payloads for 
searches that contains more then one term.


Why can't the collector get payloads data (assuming there is) with the 
doc id (or something like that).

e.g.

Instead of:

Collector.collect(int doc)


It'll be:

Collector.collect(int doc, Payload payload)

or:

Collector.collect(int doc, Payload[] payload)

For all the terms payloads.


Do I miss here something?
Is that feasibly?
Is that possible with the bulk reading of the index?

Thanks,

Ophir


On 2/1/2011 12:18 PM, Ophir Cohen wrote:

> Hi Guys,
>
> I've been using Lucene for more than 5 years and it is a great tool - 
> great job! Thanks for everything...
>
>
> Lately I encountered the new payloads support and it looks its a great 
> solution for my project.
>
>
> *The problem:*
>
> The use case is as follows:
>
> I need to support a way to calculate statistics on web pages.
>
> Each page has few metrics that comes with it (how many user saw it, 
> what was the average time on page etc...).
>
>
> The requirement is to support query such as:
>
> How many users saw pages contains the tokens 'house' and 'white'.
>
> Or
>
> What was the average time on pages contains tokens 'horse' and 'pony'.
>
>
> *First solution:*
>
> Add pages to Lucene, index the words and store the metrics.
>
> *The problem: performance.*
>
> Not as regular search, I need to provide results for all matched 
> documents and those I need to iterate on all results and load the 
> document data.
> This method take to much time.
>
>
> *Better solution:*
>
> Store the metrics as payloads and calculate the needed data without 
> access to the storage - a huge performance boost.
>
>
> The problem is (unless I miss something) that I can't get the payloads 
> from anything except TermPositions and it isn't good enough as I want 
> to use complex queries.
>
> Is there is any other way to access it?
>
> One option can be to get the payload with the document id in the 
> collector.
>
>
> Any ideas/comments/suggests?
> -- 
> Thanks in advance,
> Ophir Cohen

Mime
View raw message