lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Bernstein <joels...@gmail.com>
Subject Re: Solr 6 - Relational Index querying
Date Mon, 28 Dec 2015 14:11:59 GMT
I'll add one important caveat:

At this time the /export handler does not support returning scores. In
order to join result sets you would typically need to be working with the
entire result sets from both sides of the join, which may be too slow
without the /export handler. But if you're working with smaller result sets
it will be possible to use the default /select handler which will return
scores.

Adding scores to the /export handler does need to get on the roadmap. The
initial release of the Streaming API was really designed for OLAP type
queries which typically don't involve scoring.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Dec 28, 2015 at 8:49 AM, Dennis Gove <dpgove@gmail.com> wrote:

> There have been a lot of new features added to the Streaming API and the
> documentation hasn't kept pace, but it is something I'd like to have filled
> in by the release of Solr 6.
>
> With the Streaming API you can take two (or more) totally disconnected
> collections and get a result set with documents from one, both, or all of
> them. To be clear, when I say they can be totally disconnected I mean
> exactly that - the collections do not need to share any infrastructure or
> even know about each other in anyway. They can exist across any number of
> data centers, use completely different Zookeeper clusters, etc... No shared
> infrastructure is necessary. Updates/Inserts/Deletes to one of the
> collections has zero impact on the other collections.
>
> In your example, with Items and FacilityItems, I'd most likely construct a
> join like this (note, I'm using Streaming Expresssions but the same would
> be possible in SQL).
>
> innerJoin(
>   search(items, fl="itemId,itemDescription", q="*:*", sort="itemId asc"),
>   search(facilityItems, fl="itemId,facilityName,cost", q="*:*",
> sort="itemId asc"),
>   on="itemId"
> )
>
> This will return documents with the fields itemId, itemDescription,
> facilityName, and cost. Because it's an innerJoin only documents with parts
> found in both collections will be returned but if you want you can do a
> leftOuterJoin as well to get items which may not have facilityItems
> documents.
>
> Regarding the use of boosting - I'll assume that's because you're returning
> results in score order. I can't remember the syntax to use in the
> search(...) clause to tell it to search by score but for the sake of
> discussion let's assume that sort="score desc" would do that (ie, highest
> score first). This poses a problem on the innerJoin because as it is a
> merge based join it does expect the two incoming streams to be sorted by
> the same fields but with a score sort that isn't possible. However, we can
> instead use a hash based join to get around this limitation.
>
> hashJoin(
>   search(items, fl="itemId,itemDescription", q="itemDescription:bear",
> sort="score desc"),
>   hashed = search(facilityItems, fl="itemId,facilityName,cost", q="*:*",
> sort="itemId asc"),
>   on="itemId"
> )
>
> Note that in this I've changed the first search clause by adding a q clause
> to find all where the description includes "bear" and to sort by the score.
> I've also marked the second search clause as the on that should be hashed.
> The stream that is marked to be hashed will be read in full and all
> documents stored in memory - for this reason you'll almost always want to
> hash the one with the fewest documents in it but do be aware that the order
> of the results will depend on the order of the non-hashed stream. For this
> reason I've hashed the one whose order we don't necessarily care about and
> am preserving the ordering by score.
>
> This will return the exact same documents but the order will now be by the
> score of the match found in the search over the items collections.
>
> - Dennis
>
> On Wed, Dec 23, 2015 at 10:43 PM, Troy Edwards <tedwards415107@gmail.com>
> wrote:
>
> > In Solr 5.1.0 we had to flatten out two collections into one
> >
> > Item - about 1.5 million items with primary key - ItemId (this mainly
> > contains item description)
> >
> > FacilityItem - about 10,000 facilities - primary key - FacilityItemId
> > (pricing information for each facility) - ItemId points to Item
> >
> > We are currently using this index for only about 200 facilities. We are
> > using edismax parser to query and boost results
> >
> > I am hoping that in Solr 6 with Parallel SQL or stream innerJoin we can
> use
> > two collections so that it will be helpful in doing updates.
> >
> > But so far I have not seen something that will exactly fit what we need.
> >
> > Any thoughts/suggestions on what documentation to read or any samples on
> > how to approach what we are trying to achieve?
> >
> > Thanks
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message