lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Bernstein <joels...@gmail.com>
Subject Re: Solr best practices for many to many relations...
Date Fri, 15 Apr 2016 14:10:25 GMT
You may also want to keep an eye on SOLR-8925 which supports distributed,
cross collection graph traversals. This may be useful in traversing the
relationships.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Apr 15, 2016 at 9:56 AM, Joel Bernstein <joelsolr@gmail.com> wrote:

> Solr now has full distributed join capabilities as part of the Streaming
> Expression library. Keep in mind that these are distributed joins so they
> shuffle records to worker nodes to perform the joins. These are comparable
> to joins done by SQL over MapReduce systems, but they are very responsive
> and can respond with sub-second response time for fairly large joins in
> parallel mode. But these joins do lend themselves to large distributed
> architectures (lot's of shards an replicas). Target QPS also needs to be
> taken into account and tested in deciding whether these joins will meet the
> specific use case.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Apr 15, 2016 at 9:17 AM, Dennis Gove <dpgove@gmail.com> wrote:
>
>> The Streaming API with Streaming Expressions (or Parallel SQL if you want
>> to use SQL) can give you the functionality you're looking for. See
>> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>> and
>> https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface.
>> SQL queries coming in through the Parallel SQL Interface are translated
>> down into Streaming Expressions - if you need to do something that SQL
>> doesn't yet support you should check out the Streaming Expressions to see
>> if it can support it.
>>
>> With these you could store your data in separate collections (or the same
>> collection with different docType field values) and then during search
>> perform a join (inner, outer, hash) across the collections. You could, if
>> you wanted, even join with data NOT in solr using the jdbc streaming
>> function.
>>
>> - Dennis Gove
>>
>>
>> On Fri, Apr 15, 2016 at 3:21 AM, Bastien Latard - MDPI AG <
>> latard@mdpi.com.invalid> wrote:
>>
>>> '*would I then be able to query a specific field of articles or other
>>> "table" (with the same OR BETTER performances)?*'
>>> -> And especially, would I be able to get only 1 article in the result...
>>>
>>> On 15/04/2016 09:06, Bastien Latard - MDPI AG wrote:
>>>
>>> Thanks Jack.
>>>
>>> I know that Solr is a search engine, but this replace a search in my
>>> mysql DB with this model:
>>>
>>>
>>> *My goal is to improve my environment (and my performances at the same
>>> time).*
>>>
>>> *Yes, I have a Solr data model... but atm I created 4 different indexes
>>> for "similar service usage".*
>>> *So atm, for 70 millions of documents, I am duplicating journal data and
>>> publisher data all the time in 1 index (for all articles from the same
>>> journal/pub) in order to be able to retrieve all data in 1 query...*
>>>
>>> *I found yesterday that there is the possibility to create like an array
>>> of <entity> in the data-conf.xml.*
>>> e.g. (pseudo code - incomplete):
>>> <entity  name="solr_publisher" query="select name from publishers">
>>> <entity name="solr_journal" query="select name as j_name from journals
>>> WHERE publisher_id='${solr_publisher.id}'">
>>> <entity name="solr_articles" query="select title, abstract from articles
>>> WHERE journal_id='${solr_journal.id}'">
>>> <entity name="solr_authors" query="select given_name, last_name from
>>> authors WHERE article_id='${solr_article.id}'">
>>>
>>>
>>> * Would this be a good option? Is this the denormalization you were
>>> proposing? *
>>>
>>> *If yes, would I then be able to query a specific field of articles or
>>> other "table" (with the same OR BETTER performances)? If yes, I might
>>> probably merge all the different indexes together. *
>>> *I'm currently joining everything in mysql, so duplicating the fields in
>>> the solr (pseudo code):*
>>> <entity  name="all" query="select * from articles INNER JOIN journal on
>>> [...]">
>>> *So I have an index for authors query, a general one for articles (only
>>> needed info of other tables) ...*
>>>
>>> Thanks in advance for the tips. :)
>>>
>>> Kind regards,
>>> Bastien
>>>
>>> On 14/04/2016 16:23, Jack Krupansky wrote:
>>>
>>> Solr is a search engine, not a database.
>>>
>>> JOINs? Although Solr does have some limited JOIN capabilities, they are
>>> more for special situations, not the front-line go-to technique for data
>>> modeling for search.
>>>
>>> Rather, denormalization is the front-line go-to technique for data
>>> modeling in Solr.
>>>
>>> In any case, the first step in data modeling is always to focus on your
>>> queries - what information will be coming into your apps and what
>>> information will the apps want to access based on those inputs.
>>>
>>> But wait... you say you are upgrading, which suggests that you have an
>>> existing Solr data model, and probably queries as well. So...
>>>
>>> 1. Share at least a summary of your existing Solr data model as well as
>>> at least a summary of the kinds of queries you perform today.
>>> 2. Tell us what exacting is driving your inquiry - are queries too slow,
>>> too cumbersome, not sufficiently powerful, or... what exactly is the
>>> problem you need to solve.
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Thu, Apr 14, 2016 at 10:12 AM, Bastien Latard - MDPI AG <
>>> <latard@mdpi.com.invalid>latard@mdpi.com.invalid> wrote:
>>>
>>>> Hi Guys,
>>>>
>>>> *I am upgrading from solr 4.2 to 6.0.*
>>>> *I successfully (after some time) migrated the config files and other
>>>> parameters...*
>>>>
>>>> Now I'm just wondering if my indexes are following the best
>>>> practices...(and they are probably not :-) )
>>>>
>>>> What would be the best if we have this kind of sql data to write in
>>>> Solr:
>>>>
>>>>
>>>> I have several different services which need (more or less), different
>>>> data based on these JOINs...
>>>>
>>>> e.g.:
>>>> Service A needs lots of data (but bot all),
>>>> Service B needs a few data (some fields already included in A),
>>>> Service C needs a bit more data than B(some fields already included in
>>>> A/B)...
>>>>
>>>> *1. Would it be better to create one single index?*
>>>> *-> i.e.: this will duplicate journal info for every single article*
>>>>
>>>> *2. Would it be better to create several specific indexes for each
>>>> similar services?*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *-> i.e.: this will use more space on the disks (and there are
>>>> ~70millions of documents to join) 3. Would it be better to create an index
>>>> per table and make a join? -> if yes, how?? *
>>>>
>>>> Kind regards,
>>>> Bastien
>>>>
>>>>
>>>
>>> Kind regards,
>>> Bastien Latard
>>> Web engineer
>>> --
>>> MDPI AG
>>> Postfach, CH-4005 Basel, Switzerland
>>> Office: Klybeckstrasse 64, CH-4057
>>> Tel. +41 61 683 77 35
>>> Fax: +41 61 302 89 18
>>> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>>>
>>>
>>> Kind regards,
>>> Bastien Latard
>>> Web engineer
>>> --
>>> MDPI AG
>>> Postfach, CH-4005 Basel, Switzerland
>>> Office: Klybeckstrasse 64, CH-4057
>>> Tel. +41 61 683 77 35
>>> Fax: +41 61 302 89 18
>>> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message