lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <DOR...@il.ibm.com>
Subject Re: Customized search with Lucene?
Date Thu, 25 Oct 2007 06:48:57 GMT
Lukas,

Thanks for the link to the book, seems very interesting.
Your original question made me think you are intending
to maintain personalized score modifications - i.e. for
each user you are going to custom the scores differently
and I was curious to know how you are going to maintain
custom scoring info for each individual user.

So you are thinking about collecting clicks data
(of all searching users) and aggregate that data,
learn from it that some documents are more relevant
for some queries, and then use this data at search
time. This is different say from something like in-links
or page-rank where the scoring of documents is modified
to have a static part (static score), which would be the
same for a certain doc for all queries. In your scenario doc1
can be boosted up for query5 while for query6 it is doc2
that would be boosted up.

Just to give a more complete picture note that Lucene allows
to boost each doc at indexing time. But (a) boost has to be
known at indexing time; (b) boost currently is combined with
other info and compressed into one byte at indexing time so
there is some precision loss; and (c) it would be a fixed boost
for a certain doc for all queries. So this is definitely not
what you need.

Now to custom the scoring at search time there are a few options.

The most powerful way is to write your own query,scorer,weight.
It is also the hardest task.

Another alternative is to use the search.function package:
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/function/package-summary.html
It allows you to combine into the doc score a value that you
provide, which may be different for each document. Since you
supply this value at search time and you know which query is
being served you can provide a value that matches this query.
The code samples there as well as ready to use "DocValues"
are field based - i.e. they come from indexed fields, but
you can provide your implementation of DocValues so that
for a certain query it returns say 1 for all docs but a
different value (say 1.7) only for very few docs. So if the
output of your learning phase if very sparse (few docs boosted
for each query/word) this might work.

Might.

A tricky issue here is that these values are ordered by
doc ids, which are internal Lucene doc ids. Problem is that
these doc ids are not stable once docs are removed/updated.
This is a hard problem and there are numerous discussions in the
list about it.

Any solution which is docid based (even your own new
query/weight/scorer) will have to deal with this issue.

Another approach you may consider is to augment documents with
words of queries that users searched just before clicking those
docs. This mean updating the documents so again it should be done
carefully.

Seems I'm adding more questions than answers so I'll better
stop here...

Doron

"Lukas Vlcek" <lukas.vlcek@gmail.com> wrote on 24/10/2007 23:45:21:

> Doron,
>
> Sorry for the late reply.
> I got the inspiration for this question from book Programming Collective
> Intelligence by Toby Segaran (you can check
> here<http://radar.oreilly.com/archives/2007/08/programming_col.html>for
> some O'Reilly's comments on the book). You can find a chapter
> Searching
> and Ranking in this book , namely on the page 74 there is one
> idea that one
> can use neural network to learn and capture contribution of
> individual user
> query words to selected  (clicked) document. Once the net is trained well
> then it can be used to reorder hits.
>
> The book does not elaborate the scenario where the index is
> changed often;
> however, my impression was that this technique is common for customized
> search and that Lucene/Nutch community is aware of it.
>
> After reading the chapter again this technique can be used not just for
> customized search but rather as an overall improvement to
> search capability
> of search engine (that is the case where interactions of all users with
> search engine are stored in shared neural network which is a
> good example of
> capturing collective intelligence). This technique tends to capture
> *association* between query words and individual document.
>
> The nice thing about this approach is that the network can give you
> reasonable results even for words which haven't been used
> before (as least
> that is what the book seems to claim).
>
> Regards,
> Lukas
>
> On 10/16/07, Doron Cohen <DORONC@il.ibm.com> wrote:
> >
> > Where and how do you store this type of info:
> >   If user U1 search for query Q7 boost doc D5 by B17
> >   If user U2 search for query Q3 boost doc D15 by B2
> > Seems lots of info, and it must be persistent.
> >
> > Perhaps o.a.l.search.function can help - assuming you
> > have this info available at search time, and can use
> > it to create a ValueSource.
> >
> > Doron
> >
> > "Lukas Vlcek" <lukas.vlcek@gmail.com> wrote on 13/10/2007 08:53:47:
> >
> > > Hi,
> > >
> > > I am looking for an easy (~preferred) way of implementing
> > > customized search
> > > with Lucene. What I mean by this is changing order of returned hits
> > > according to user profile. In simple words I would like to be
> > > able to tweak
> > > order of documents in Hits collection before it is presented to
> > > the client.
> > >
> > > Example: if user A issues a query "lucene" then he/she
> tends toclick the
> > > third returned link (document #123). I would like to "boost"
> > > document #123
> > > for this user so the next time he/she issues the query
> "lucene"again the
> > > hit for the document #123 will be in the first or second
> > > position (providing
> > > that the index hasn't been modified).
> > >
> > > Looking at the Searchable<http://lucene.zones.apache.org:
> > > 8080/hudson/job/Lucene-
> > > Nightly/javadoc/org/apache/lucene/search/Searchable.html>interface
> > > it does not seem it supports this kind of flexibility. I don't
> > > want to use filter<http://lucene.zones.apache.org:
> > > 8080/hudson/job/Lucene-
> > > Nightly/javadoc/org/apache/lucene/search/Filter.html>of
> > > sort<http://lucene.zones.apache.org:8080/hudson/job/Lucene-
> > > Nightly/javadoc/org/apache/lucene/search/Sort.html>(they
> > > are not appropriate for my goal).
> > >
> > > Custom scoring<http://lucene.zones.apache.org:
> > > 8080/hudson/job/Lucene-
> > > Nightly/javadoc/org/apache/lucene/search/package-summary.html#scoring
> > > >seems
> > > to be the only option left. At the first glance it seems quite
> > > complex; moreover, I am not sure it really allows me to achieve
> > > my goal, but
> > > if there is no other way then I have to try it...
> > > Any suggestions?
> > >
> > > Regards,
> > > Lukas
> > >
> > > --
> > > http://blog.lukas-vlcek.com/
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> http://blog.lukas-vlcek.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message