lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carlos Gonzalez-Cadenas <...@experienceon.com>
Subject Re: custom scoring
Date Mon, 20 Feb 2012 14:37:46 GMT
Hi Em:

The HTTP request is not gonna help you a lot because we use a custom
QParser (that builds the query that I've pasted before). In any case, here
it is:

http://localhost:8080/solr/core0/select?shards=…(shards
here)…&indent=on&wt=exon&timeAllowed=50&fl=resulting_phrase%2Cquery_id%2Ctype%2Chighlighting&start=0&rows=16&limit=20&q=%7B!exonautocomplete%7Dhoteles<http://localhost:8080/solr/core0/select?shards=exp302%3A8983%2Fsolr%2Fcore0%2Cexp302%3A8983%2Fsolr%2Fcore1%2Cexp302%3A8983%2Fsolr%2Fcore2%2Cexp302%3A8983%2Fsolr%2Fcore3%2Cexp302%3A8983%2Fsolr%2Fcore4%2Cexp302%3A8983%2Fsolr%2Fcore5%2Cexp302%3A8983%2Fsolr%2Fcore6%2Cexp302%3A8983%2Fsolr%2Fcore7%2Cexp302%3A8983%2Fsolr%2Fcore8%2Cexp302%3A8983%2Fsolr%2Fcore9%2Cexp302%3A8983%2Fsolr%2Fcore10%2Cexp302%3A8983%2Fsolr%2Fcore11&sort=score%20desc%2C%20query_score%20desc&indent=on&wt=exon&timeAllowed=50&fl=resulting_phrase%2Cquery_id%2Ctype%2Chighlighting&start=0&vrows=4&rows=16&limit=20&q=%7B!exonautocomplete%7DBARCELONA&gyvl7cn3>

We're implementing a query autocomplete system, therefore our Lucene
documents are queries. "query_score" is a field that is indexed and stored
with every document. It expresses how popular a given query is (i.e. common
queries like "hotels in barcelona" have a bigger query_score than less
common queries like "hotels in barcelona near the beach").

Let me know if you need something else.

Thanks,
Carlos





Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Mon, Feb 20, 2012 at 3:12 PM, Em <mailformailinglists@yahoo.de> wrote:

> Could you please provide me the original request (the HTTP-request)?
> I am a little bit confused to what "query_score" refers.
> As far as I can see it isn't a magic-value.
>
> Kind regards,
> Em
>
> Am 20.02.2012 14:05, schrieb Carlos Gonzalez-Cadenas:
> > Yeah Em, it helped a lot :)
> >
> > Here it is (for the user query "hoteles"):
> >
> > *+(stopword_shortened_phrase:hoteles | stopword_phrase:hoteles |
> > wildcard_stopword_shortened_phrase:hoteles |
> > wildcard_stopword_phrase:hoteles) *
> >
> > *product(pow(query((stopword_shortened_phrase:hoteles |
> > stopword_phrase:hoteles | wildcard_stopword_shortened_phrase:hoteles |
> >
> wildcard_stopword_phrase:hoteles),def=0.0),const(0.5)),float(query_score))*
> >
> > Thanks a lot for your help.
> >
> > Carlos
> > Carlos Gonzalez-Cadenas
> > CEO, ExperienceOn - New generation search
> > http://www.experienceon.com
> >
> > Mobile: +34 652 911 201
> > Skype: carlosgonzalezcadenas
> > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >
> >
> > On Mon, Feb 20, 2012 at 1:50 PM, Em <mailformailinglists@yahoo.de>
> wrote:
> >
> >> Carlos,
> >>
> >> nice to hear that the approach helped you!
> >>
> >> Could you show us how your query-request looks like after reworking?
> >>
> >> Regards,
> >> Em
> >>
> >> Am 20.02.2012 13:30, schrieb Carlos Gonzalez-Cadenas:
> >>> Hello all:
> >>>
> >>> We've done some tests with Em's approach of putting a BooleanQuery in
> >> front
> >>> of our user query, that means:
> >>>
> >>> BooleanQuery
> >>>     must (DismaxQuery)
> >>>     should (FunctionQuery)
> >>>
> >>> The FunctionQuery obtains the SOLR IR score by means of a
> >> QueryValueSource,
> >>> then does the SQRT of this value, and then multiplies it by our custom
> >>> "query_score" float, pulling it by means of a FieldCacheSource.
> >>>
> >>> In particular, we've proceeded in the following way:
> >>>
> >>>    - we've loaded the whole index in the page cache of the OS to make
> >> sure
> >>>    we don't have disk IO problems that might affect the benchmarks (our
> >>>    machine has enough memory to load all the index in RAM)
> >>>    - we've executed an out-of-benchmark query 10-20 times to make sure
> >> that
> >>>    everything is jitted and that Lucene's FieldCache is properly
> >> populated.
> >>>    - we've disabled all the caches (filter query cache, document cache,
> >>>    query cache)
> >>>    - we've executed 8 different user queries with and without
> >>>    FunctionQueries, with early termination in both cases (our collector
> >> stops
> >>>    after collecting 50 documents per shard)
> >>>
> >>> Em was correct, the query is much faster with the BooleanQuery in
> front,
> >>> but it's still 30-40% slower than the query without FunctionQueries.
> >>>
> >>> Although one may think that it's reasonable that the query response
> time
> >>> increases because of the extra computations, we believe that the
> increase
> >>> is too big, given that we're collecting just 500-600 documents due to
> the
> >>> early query termination techniques we currently use.
> >>>
> >>> Any ideas on how to make it faster?.
> >>>
> >>> Thanks a lot,
> >>> Carlos
> >>>
> >>> Carlos Gonzalez-Cadenas
> >>> CEO, ExperienceOn - New generation search
> >>> http://www.experienceon.com
> >>>
> >>> Mobile: +34 652 911 201
> >>> Skype: carlosgonzalezcadenas
> >>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >>>
> >>>
> >>> On Fri, Feb 17, 2012 at 11:07 AM, Carlos Gonzalez-Cadenas <
> >>> cgc@experienceon.com> wrote:
> >>>
> >>>> Thanks Em, Robert, Chris for your time and valuable advice. We'll make
> >>>> some tests and will let you know soon.
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Feb 16, 2012 at 11:43 PM, Em <mailformailinglists@yahoo.de>
> >> wrote:
> >>>>
> >>>>> Hello Carlos,
> >>>>>
> >>>>> I think we missunderstood eachother.
> >>>>>
> >>>>> As an example:
> >>>>> BooleanQuery (
> >>>>>  clauses: (
> >>>>>     MustMatch(
> >>>>>               DisjunctionMaxQuery(
> >>>>>                   TermQuery("stopword_field", "barcelona"),
> >>>>>                   TermQuery("stopword_field", "hoteles")
> >>>>>               )
> >>>>>     ),
> >>>>>     ShouldMatch(
> >>>>>                  FunctionQuery(
> >>>>>                    *please insert your function here*
> >>>>>                 )
> >>>>>     )
> >>>>>  )
> >>>>> )
> >>>>>
> >>>>> Explanation:
> >>>>> You construct an artificial BooleanQuery which wraps your user's
> query
> >>>>> as well as your function query.
> >>>>> Your user's query - in that case - is just a DisjunctionMaxQuery
> >>>>> consisting of two TermQueries.
> >>>>> In the real world you might construct another BooleanQuery around
> your
> >>>>> DisjunctionMaxQuery in order to have more flexibility.
> >>>>> However the interesting part of the given example is, that we specify
> >>>>> the user's query as a MustMatch-condition of the BooleanQuery and
the
> >>>>> FunctionQuery just as a ShouldMatch.
> >>>>> Constructed that way, I am expecting the FunctionQuery only scores
> >> those
> >>>>> documents which fit the MustMatch-Condition.
> >>>>>
> >>>>> I conclude that from the fact that the FunctionQuery-class also
has a
> >>>>> skipTo-method and I would expect that the scorer will use it to
score
> >>>>> only matching documents (however I did not search where and how
it
> >> might
> >>>>> get called).
> >>>>>
> >>>>> If my conclusion is wrong than hopefully Robert Muir (as far as
I can
> >>>>> see the author of that class) can tell us what was the intention
by
> >>>>> constructing an every-time-match-all-function-query.
> >>>>>
> >>>>> Can you validate whether your QueryParser constructs a query in
the
> >> form
> >>>>> I drew above?
> >>>>>
> >>>>> Regards,
> >>>>> Em
> >>>>>
> >>>>> Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
> >>>>>> Hello Em:
> >>>>>>
> >>>>>> 1) Here's a printout of an example DisMax query (as you can
see
> mostly
> >>>>> MUST
> >>>>>> terms except for some SHOULD terms used for boosting scores
for
> >>>>> stopwords)
> >>>>>> *
> >>>>>> *
> >>>>>> *((+stopword_shortened_phrase:hoteles
> >>>>> +stopword_shortened_phrase:barcelona
> >>>>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
> >>>>>> +stopword_phrase:barcelona
> >>>>>> stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
> >>>>> +stopword_short
> >>>>>> ened_phrase:barcelona stopword_shortened_phrase:en) |
> >>>>> (+stopword_phrase:hoteles
> >>>>>> +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
> >>>>>> tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
> >>>>>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
> >>>>> +wildcard_stopw
> >>>>>> ord_phrase:barcelona stopword_phrase:en) |
> >>>>> (+stopword_shortened_phrase:hoteles
> >>>>>> +wildcard_stopword_shortened_phrase:barcelona
> >>>>> stopword_shortened_phrase:en)
> >>>>>> | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
> >>>>>> stopword_phrase:en))*
> >>>>>> *
> >>>>>> *
> >>>>>> 2)* *The collector is inserted in the SolrIndexSearcher (replacing
> the
> >>>>>> TimeLimitingCollector). We trigger it through the SOLR interface
by
> >>>>> passing
> >>>>>> the timeAllowed parameter. We know this is a hack but AFAIK
there's
> no
> >>>>>> out-of-the-box way to specify custom collectors by now (
> >>>>>> https://issues.apache.org/jira/browse/SOLR-1680). In any case
the
> >>>>> collector
> >>>>>> part works perfectly as of now, so clearly this is not the problem.
> >>>>>>
> >>>>>> 3) Re: your sentence:
> >>>>>> *
> >>>>>> *
> >>>>>> **I* would expect that with a shrinking set of matching documents
to
> >>>>>> the overall-query, the function query only checks those documents
> that
> >>>>> are
> >>>>>> guaranteed to be within the result set.*
> >>>>>> *
> >>>>>> *
> >>>>>> Yes, I agree with this, but this snippet of code in
> FunctionQuery.java
> >>>>>> seems to say otherwise:
> >>>>>>
> >>>>>>     // instead of matching all docs, we could also embed a query.
> >>>>>>     // the score could either ignore the subscore, or boost
it.
> >>>>>>     // Containment:  floatline(foo:myTerm, "myFloatField", 1.0,
> 0.0f)
> >>>>>>     // Boost:        foo:myTerm^floatline("myFloatField",1.0,0.0f)
> >>>>>>     @Override
> >>>>>>     public int nextDoc() throws IOException {
> >>>>>>       for(;;) {
> >>>>>>         ++doc;
> >>>>>>         if (doc>=maxDoc) {
> >>>>>>           return doc=NO_MORE_DOCS;
> >>>>>>         }
> >>>>>>         if (acceptDocs != null && !acceptDocs.get(doc))
continue;
> >>>>>>         return doc;
> >>>>>>       }
> >>>>>>     }
> >>>>>>
> >>>>>> It seems that the author also thought of maybe embedding a query
in
> >>>>> order
> >>>>>> to restrict matches, but this doesn't seem to be in place as
of now
> >> (or
> >>>>>> maybe I'm not understanding how the whole thing works :) ).
> >>>>>>
> >>>>>> Thanks
> >>>>>> Carlos
> >>>>>> *
> >>>>>> *
> >>>>>>
> >>>>>> Carlos Gonzalez-Cadenas
> >>>>>> CEO, ExperienceOn - New generation search
> >>>>>> http://www.experienceon.com
> >>>>>>
> >>>>>> Mobile: +34 652 911 201
> >>>>>> Skype: carlosgonzalezcadenas
> >>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Feb 16, 2012 at 8:09 PM, Em <mailformailinglists@yahoo.de>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hello Carlos,
> >>>>>>>
> >>>>>>>> We have some more tests on that matter: now we're moving
from
> >> issuing
> >>>>>>> this
> >>>>>>>> large query through the SOLR interface to creating our
own
> >>>>>>> QueryParser. The
> >>>>>>>> initial tests we've done in our QParser (that internally
creates
> >>>>> multiple
> >>>>>>>> queries and inserts them inside a DisjunctionMaxQuery)
are very
> >> good,
> >>>>>>> we're
> >>>>>>>> getting very good response times and high quality answers.
But
> when
> >>>>> we've
> >>>>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery
(i.e.
> >>>>> with a
> >>>>>>>> QueryValueSource that wraps the DisMaxQuery), then the
times move
> >> from
> >>>>>>>> 10-20 msec to 200-300msec.
> >>>>>>> I reviewed the sourcecode and yes, the FunctionQuery iterates
over
> >> the
> >>>>>>> whole index, however... let's see!
> >>>>>>>
> >>>>>>> In relation to the DisMaxQuery you create within your parser:
What
> >> kind
> >>>>>>> of clause is the FunctionQuery and what kind of clause are
your
> other
> >>>>>>> queries (MUST, SHOULD, MUST_NOT...)?
> >>>>>>>
> >>>>>>> *I* would expect that with a shrinking set of matching documents
to
> >> the
> >>>>>>> overall-query, the function query only checks those documents
that
> >> are
> >>>>>>> guaranteed to be within the result set.
> >>>>>>>
> >>>>>>>> Note that we're using early termination of queries (via
a custom
> >>>>>>>> collector), and therefore (as shown by the numbers I
included
> above)
> >>>>> even
> >>>>>>>> if the query is very complex, we're getting very fast
answers. The
> >>>>> only
> >>>>>>>> situation where the response time explodes is when we
include a
> >>>>>>>> FunctionQuery.
> >>>>>>> Could you give us some details about how/where did you plugin
the
> >>>>>>> Collector, please?
> >>>>>>>
> >>>>>>> Kind regards,
> >>>>>>> Em
> >>>>>>>
> >>>>>>> Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
> >>>>>>>> Hello Em:
> >>>>>>>>
> >>>>>>>> Thanks for your answer.
> >>>>>>>>
> >>>>>>>> Yes, we initially also thought that the excessive increase
in
> >> response
> >>>>>>> time
> >>>>>>>> was caused by the several queries being executed, and
we did
> another
> >>>>>>> test.
> >>>>>>>> We executed one of the subqueries that I've shown to
you directly
> in
> >>>>> the
> >>>>>>>> "q" parameter and then we tested this same subquery
(only this
> one,
> >>>>>>> without
> >>>>>>>> the others) with the function query "query($q1)" in
the "q"
> >> parameter.
> >>>>>>>>
> >>>>>>>> Theoretically the times for these two queries should
be more or
> less
> >>>>> the
> >>>>>>>> same, but the second one is several times slower than
the first
> one.
> >>>>>>> After
> >>>>>>>> this observation we learned more about function queries
and we
> >> learned
> >>>>>>> from
> >>>>>>>> the code and from some comments in the forums [1] that
the
> >>>>>>> FunctionQueries
> >>>>>>>> are expected to match all documents.
> >>>>>>>>
> >>>>>>>> We have some more tests on that matter: now we're moving
from
> >> issuing
> >>>>>>> this
> >>>>>>>> large query through the SOLR interface to creating our
own
> >>>>> QueryParser.
> >>>>>>> The
> >>>>>>>> initial tests we've done in our QParser (that internally
creates
> >>>>> multiple
> >>>>>>>> queries and inserts them inside a DisjunctionMaxQuery)
are very
> >> good,
> >>>>>>> we're
> >>>>>>>> getting very good response times and high quality answers.
But
> when
> >>>>> we've
> >>>>>>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery
(i.e.
> >>>>> with a
> >>>>>>>> QueryValueSource that wraps the DisMaxQuery), then the
times move
> >> from
> >>>>>>>> 10-20 msec to 200-300msec.
> >>>>>>>>
> >>>>>>>> Note that we're using early termination of queries (via
a custom
> >>>>>>>> collector), and therefore (as shown by the numbers I
included
> above)
> >>>>> even
> >>>>>>>> if the query is very complex, we're getting very fast
answers. The
> >>>>> only
> >>>>>>>> situation where the response time explodes is when we
include a
> >>>>>>>> FunctionQuery.
> >>>>>>>>
> >>>>>>>> Re: your question of what we're trying to achieve ...
We're
> >>>>> implementing
> >>>>>>> a
> >>>>>>>> powerful query autocomplete system, and we use several
fields to
> a)
> >>>>>>> improve
> >>>>>>>> performance on wildcard queries and b) have a very precise
control
> >>>>> over
> >>>>>>> the
> >>>>>>>> score.
> >>>>>>>>
> >>>>>>>> Thanks a lot for your help,
> >>>>>>>> Carlos
> >>>>>>>>
> >>>>>>>> [1]:
> >>>>>>>
> >>>>>
> >> http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0
> >>>>>>>>
> >>>>>>>> Carlos Gonzalez-Cadenas
> >>>>>>>> CEO, ExperienceOn - New generation search
> >>>>>>>> http://www.experienceon.com
> >>>>>>>>
> >>>>>>>> Mobile: +34 652 911 201
> >>>>>>>> Skype: carlosgonzalezcadenas
> >>>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Feb 16, 2012 at 7:09 PM, Em <mailformailinglists@yahoo.de
> >
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello Carlos,
> >>>>>>>>>
> >>>>>>>>> well, you must take into account that you are executing
up to 8
> >>>>> queries
> >>>>>>>>> per request instead of one query per request.
> >>>>>>>>>
> >>>>>>>>> I am not totally sure about the details of the implementation
of
> >> the
> >>>>>>>>> max-function-query, but I guess it first iterates
over the
> results
> >> of
> >>>>>>>>> the first max-query, afterwards over the results
of the second
> >>>>> max-query
> >>>>>>>>> and so on. This is a much higher complexity than
in the case of a
> >>>>> normal
> >>>>>>>>> query.
> >>>>>>>>>
> >>>>>>>>> I would suggest you to optimize your request. I
don't think that
> >> this
> >>>>>>>>> particular function query is matching *all* docs.
Instead I think
> >> it
> >>>>>>>>> just matches those docs specified by your inner-query
(although I
> >>>>> might
> >>>>>>>>> be wrong about that).
> >>>>>>>>>
> >>>>>>>>> What are you trying to achieve by your request?
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Em
> >>>>>>>>>
> >>>>>>>>> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
> >>>>>>>>>> Hello Em:
> >>>>>>>>>>
> >>>>>>>>>> The URL is quite large (w/ shards, ...), maybe
it's best if I
> >> paste
> >>>>> the
> >>>>>>>>>> relevant parts.
> >>>>>>>>>>
> >>>>>>>>>> Our "q" parameter is:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)))))\"",
> >>>>>>>>>>
> >>>>>>>>>> The subqueries q8, q7, q4 and q3 are regular
queries, for
> example:
> >>>>>>>>>>
> >>>>>>>>>> "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa
AND
> >>>>>>>>>> wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles
OR
> >>>>>>>>>> (stopword_phrase:las AND stopword_phrase:de)"
> >>>>>>>>>>
> >>>>>>>>>> We've executed the subqueries q3-q8 independently
and they're
> very
> >>>>>>> fast,
> >>>>>>>>>> but when we introduce the function queries as
described below,
> it
> >>>>> all
> >>>>>>>>> goes
> >>>>>>>>>> 10X slower.
> >>>>>>>>>>
> >>>>>>>>>> Let me know if you need anything else.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> Carlos
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Carlos Gonzalez-Cadenas
> >>>>>>>>>> CEO, ExperienceOn - New generation search
> >>>>>>>>>> http://www.experienceon.com
> >>>>>>>>>>
> >>>>>>>>>> Mobile: +34 652 911 201
> >>>>>>>>>> Skype: carlosgonzalezcadenas
> >>>>>>>>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Feb 16, 2012 at 4:02 PM, Em <
> mailformailinglists@yahoo.de
> >>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hello carlos,
> >>>>>>>>>>>
> >>>>>>>>>>> could you show us how your Solr-call looks
like?
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Em
> >>>>>>>>>>>
> >>>>>>>>>>> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
> >>>>>>>>>>>> Hello all:
> >>>>>>>>>>>>
> >>>>>>>>>>>> We'd like to score the matching documents
using a combination
> of
> >>>>>>> SOLR's
> >>>>>>>>>>> IR
> >>>>>>>>>>>> score with another application-specific
score that we store
> >> within
> >>>>>>> the
> >>>>>>>>>>>> documents themselves (i.e. a float field
containing the
> >>>>> app-specific
> >>>>>>>>>>>> score). In particular, we'd like to
calculate the final score
> >>>>> doing
> >>>>>>>>> some
> >>>>>>>>>>>> operations with both numbers (i.e product,
sqrt, ...)
> >>>>>>>>>>>>
> >>>>>>>>>>>> According to what we know, there are
two ways to do this in
> >> SOLR:
> >>>>>>>>>>>>
> >>>>>>>>>>>> A) Sort by function [1]: We've tested
an expression like
> >>>>>>>>>>>> "sort=product(score, query_score)" in
the SOLR query, where
> >> score
> >>>>> is
> >>>>>>>>> the
> >>>>>>>>>>>> common SOLR IR score and query_score
is our own precalculated
> >>>>> score,
> >>>>>>>>> but
> >>>>>>>>>>> it
> >>>>>>>>>>>> seems that SOLR can only do this with
stored/indexed fields
> (and
> >>>>>>>>>>> obviously
> >>>>>>>>>>>> "score" is not stored/indexed).
> >>>>>>>>>>>>
> >>>>>>>>>>>> B) Function queries: We've used _val_
and function queries
> like
> >>>>> max,
> >>>>>>>>> sqrt
> >>>>>>>>>>>> and query, and we've obtained the desired
results from a
> >>>>> functional
> >>>>>>>>> point
> >>>>>>>>>>>> of view. However, our index is quite
large (400M documents)
> and
> >>>>> the
> >>>>>>>>>>>> performance degrades heavily, given
that function queries are
> >>>>> AFAIK
> >>>>>>>>>>>> matching all the documents.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I have two questions:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1) Apart from the two options I mentioned,
is there any other
> >>>>>>> (simple)
> >>>>>>>>>>> way
> >>>>>>>>>>>> to achieve this that we're not aware
of?
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2) If we have to choose the function
queries path, would it be
> >>>>> very
> >>>>>>>>>>>> difficult to modify the actual implementation
so that it
> doesn't
> >>>>>>> match
> >>>>>>>>>>> all
> >>>>>>>>>>>> the documents, that is, to pass a query
so that it only
> operates
> >>>>> over
> >>>>>>>>> the
> >>>>>>>>>>>> documents matching the query?. Looking
at the
> FunctionQuery.java
> >>>>>>> source
> >>>>>>>>>>>> code, there's a comment that says "//
instead of matching all
> >>>>> docs,
> >>>>>>> we
> >>>>>>>>>>>> could also embed a query. the score
could either ignore the
> >>>>> subscore,
> >>>>>>>>> or
> >>>>>>>>>>>> boost it", which is giving us some hope
that maybe it's
> possible
> >>>>> and
> >>>>>>>>> even
> >>>>>>>>>>>> desirable to go in this direction. If
you can give us some
> >>>>> directions
> >>>>>>>>>>> about
> >>>>>>>>>>>> how to go about this, we may be able
to do the actual
> >>>>> implementation.
> >>>>>>>>>>>>
> >>>>>>>>>>>> BTW, we're using Lucene/SOLR trunk.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks a lot for your help.
> >>>>>>>>>>>> Carlos
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1]:
> http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message