lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Keegan <peterlkee...@gmail.com>
Subject Re: Cross index join query performance
Date Mon, 30 Sep 2013 17:00:23 GMT
Ah, got it now - thanks for the explanation.


On Sat, Sep 28, 2013 at 3:33 AM, Upayavira <uv@odoko.co.uk> wrote:

> The thing here is to understand how a join works.
>
> Effectively, it does the inner query first, which results in a list of
> terms. It then effectively does a multi-term query with those values.
>
> q=size:large {!join fromIndex=other from=someid
> to=someotherid}type:shirt
>
> Imagine the inner join returned values A,B,C. Your inner query is, on
> core 'other', q=type:shirt&fl=someid.
>
> Then your outer query becomes size:large someotherid:(A B C)
>
> Your inner query returns 25k values. You're having to do a multi-term
> query for 25k terms. That is *bound* to be slow.
>
> The pseudo-joins in Solr 4.x are intended for a small to medium number
> of values returned by the inner query, otherwise performance degrades as
> you are seeing.
>
> Is there a way you can reduce the number of values returned by the inner
> query?
>
> As Joel mentions, those other joins are attempts to find other ways to
> work with this limitation.
>
> Upayavira
>
> On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote:
> > Hi Joel,
> >
> > I tried this patch and it is quite a bit faster. Using the same query on
> > a
> > larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
> > QTime was 100 msec! This was for true for large and small result sets.
> >
> > A few notes: the patch didn't compile with 4.3 because of the
> > SolrCore.getLatestSchema call (which I worked around), and the package
> > name
> > should be:
> > <queryParser name="hjoin"
> > class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/>
> >
> > Unfortunately, I just learned that our uniqueKey may have to be an
> > alphanumeric string instead of an int, so I'm not out of the woods yet.
> >
> > Good stuff - thanks.
> >
> > Peter
> >
> >
> > On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein <joelsolr@gmail.com>
> > wrote:
> >
> > > It looks like you are using int join keys so you may want to check out
> > > SOLR-4787, specifically the hjoin and bjoin.
> > >
> > > These perform well when you have a large number of results from the
> > > fromIndex. If you have a small number of results in the fromIndex the
> > > standard join will be faster.
> > >
> > >
> > > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan <peterlkeegan@gmail.com
> > > >wrote:
> > >
> > > > I forgot to mention - this is Solr 4.3
> > > >
> > > > Peter
> > > >
> > > >
> > > >
> > > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan <
> peterlkeegan@gmail.com
> > > > >wrote:
> > > >
> > > > > I'm doing a cross-core join query and the join query is 30X slower
> than
> > > > > each of the 2 individual queries. Here are the queries:
> > > > >
> > > > > Main query:
> http://localhost:8983/solr/mainindex/select?q=title:java
> > > > > QTime: 5 msec
> > > > > hit count: 1000
> > > > >
> > > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1TO
> > > > 0.3]
> > > > > QTime: 4 msec
> > > > > hit count: 25K
> > > > >
> > > > > Join query:
> > > > >
> > > >
> > >
> http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docidto=docid}fld1:[0.1
TO 0.3]
> > > > > QTime: 160 msec
> > > > > hit count: 205
> > > > >
> > > > > Here are the index spec's:
> > > > >
> > > > > mainindex size: 117K docs, 1 segment
> > > > > mainindex schema:
> > > > >    <field name="docid" type="int" indexed="true" stored="true"
> > > > > required="true" multiValued="false" />
> > > > >    <field name="title" type="text_en_splitting" indexed="true"
> > > > > stored="true" multiValued="false" />
> > > > >    <uniqueKey>docid</uniqueKey>
> > > > >
> > > > > subindex size: 117K docs, 1 segment
> > > > > subindex schema:
> > > > >    <field name="docid" type="int" indexed="true" stored="true"
> > > > > required="true" multiValued="false" />
> > > > >    <field name="fld1" type="float" indexed="true" stored="true"
> > > > > required="false" multiValued="false" />
> > > > >    <uniqueKey>docid</uniqueKey>
> > > > >
> > > > > With debugQuery=true I see:
> > > > >   "debug":{
> > > > >     "join":{
> > > > >       "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
> > > 0.3]":{
> > > > >         "time":155,
> > > > >         "fromSetSize":24742,
> > > > >         "toSetSize":24742,
> > > > >         "fromTermCount":117810,
> > > > >         "fromTermTotalDf":117810,
> > > > >         "fromTermDirectCount":117810,
> > > > >         "fromTermHits":24742,
> > > > >         "fromTermHitsTotalDf":24742,
> > > > >         "toTermHits":24742,
> > > > >         "toTermHitsTotalDf":24742,
> > > > >         "toTermDirectCount":24627,
> > > > >         "smallSetsDeferred":115,
> > > > >         "toSetDocsAdded":24742}},
> > > > >
> > > > > Via profiler and debugger, I see 150 msec spent in the outer
> > > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This
> seems
> > > > like a
> > > > > lot of time to join the bitsets. Does this seem right?
> > > > >
> > > > > Peter
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Joel Bernstein
> > > Professional Services LucidWorks
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message