lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Bernstein <joels...@gmail.com>
Subject Re: Solr 6.2 Distributed joins
Date Wed, 05 Oct 2016 16:35:47 GMT
The relational algebra expressions (innerJoin, outerJoin etc...) will not
interact with Solr standard grouping or paging.

So if you want to group following a join you will need wrap the join with
the reduce function. There is a lot to understand about how this works
though.

There are both hashJoins and mergeJoins in streaming expressions. If it
doesn't say hash in the function name than it's a merge join.

The merge joins require the underlying streams be  sorted by the join keys.
The hashJoins do not require specific sorts.

This is an important distinction if you need to group after a join with the
reduce function. Because the reduce function requires the stream to be
sorted by the group by key.

So you'll need to choose your strategy for working out the specific
relational algebra. Here are a couple of options:

1) hashJoin, followed by a reduce. In this case you would sort the left
stream by the group by keys, because the hashJoin doesn't require a sort.
pseudo code:  reduce(outHashJoin(...))

2) merge join, re-sort the tuples by the group by keys and then reduce.
Pseudo code: reduce(sort(outerJoin(...))). The sort function does an
in-memory sort.

Notice that both options require memory. The hash join reads the right side
of the query into memory. The sort function, resorts the data in memory.
The current strategy for dealing with this is to do the joins in parallel
and spread the hashJoin or sort over a cluster of servers to take advantage
of the memory of multiple servers. Here is the pseudo code:

reduce(parallel(outerHashJoin()))

reduce(parallel(sort(outerJoin())))

Notice that the reduce is done after the parallel call. This is because the
parallel partitionKeys will need to be the join keys. In order for the
reduce to be done in parallel the partitionKeys would need to be on the
group by keys. So the reduce is done on a single node following the join.

But the heavy memory requirements come during the hashJoin or sort, while
reduce takes very little memory because it relies on the sort order.

Obviously this is a lot to take in. That's why the SQL interface is so
important. Currently it doesn't support joins, but when it does, it will
build the proper streaming expression to do the relational algebra.



































Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Oct 5, 2016 at 11:44 AM, Gurdeep Singh <gurdeepgandhi@gmail.com>
wrote:

> Hello all,
>
>
> I am exploring one of Solr 6.2 new feature (stream decorators) for one of
> my application and the "leftOuterJoin" is working as expected. (joining two
> streams and getting the data from both the collections)
>
> I need to know if group query works with these distributed joins or not?
> (like in older versions of Solr we simply get this by adding
>  group=true&group.fieldName=<fieldname> into the query)
>
> Also we use pagination in our app so I was trying parameter start=<some
> number> in join query, but it doesn't seemed to work. So how can we get
> records from a specific record number and total number of records/groups
> found in the JSON response?
>
> Any help appreciated.
>
> Thanks in advance.
>
>
>
> Best Regards,
> Gurdeep
>
> gurdeepgandhi@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message