lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Bernstein <joels...@gmail.com>
Subject Re: Solr 6 Distributed Join
Date Thu, 24 Dec 2015 15:50:50 GMT
I haven't had a chance to review. If you have a reproducible failure on a
one-to-many join go ahead and create a jira ticket.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Dec 24, 2015 at 3:25 AM, Akiel Ahmed <AHMEDAKI@uk.ibm.com> wrote:

> Hi
>
> Did you get a chance to check whether one-to-many joins were covered in
> your tests? If yes, can you make any suggestions for what I could be doing
> wrong?
>
> Cheers
>
> Akiel
>
>
>
> From:   Joel Bernstein <joelsolr@gmail.com>
> To:     solr-user@lucene.apache.org
> Date:   22/12/2015 13:03
> Subject:        Re: Solr 6 Distributed Join
>
>
>
> Just did a quick review of the InnerJoinStream and it appears that it
> should handle one-to-one, one-to-many, many-to-one and many-to-many joins.
> It will take a closer review of the tests to see if all these cases are
> covered. So the innerJoin is designed to handle the case you describe. If
> it doesn't work properly it makes sense to file a bug report.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Dec 22, 2015 at 5:55 AM, Akiel Ahmed <AHMEDAKI@uk.ibm.com> wrote:
>
> > Hi,
> >
> > I tried a straight forward join against something that is connected to
> > many things but didn't get the results I expected - I wanted to check
> > whether my expectations are off, and whether I can do anything in Solr
> to
> > do what I want. So given the data:
> >
> > id,type,e1,e2,text
> > 1,ABC,,,John Smith
> > 2,ABC,,,Jane Doe
> > 3,DEF,1,2,1
> > 4,DEF,1,2,2
> > 5,DEF,1,2,4
> > 6,DEF,1,2,8
> >
> > and the query
> >
> >
> >
>
> http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted
>
> > , fl="id", q=text:John, sort="id
> > asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
> > fl="id,e1", q=type:DEF, sort="id
> > asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
> >
> > I expected
> >
> > {"result-set":{"docs":[
> > {"e1":"1","id":"3"},
> > {"e1":"1","id":"4"},
> > {"e1":"1","id":"5"},
> > {"e1":"1","id":"6"},
> > {"EOF":true,"RESPONSE_TIME":56}]}}
> >
> > but instead I got
> >
> > {"result-set":{"docs":[
> > {"e1":"1","id":"3"},
> > {"EOF":true,"RESPONSE_TIME":58}]}}
> >
> > Deleting the document with id 3, and rerunning the query (see above)
> > returned
> >
> > {"result-set":{"docs":[
> > {"e1":"1","id":"4"},
> > {"EOF":true,"RESPONSE_TIME":56}]}}
> >
> > So it looks like the join finds the first thing to join on. Is this
> > expected behaviour? If so, is there anyway I can do to convince Solr to
> > return all the things it is connected to?
> >
> > Cheers
> >
> > Akiel
> > ----- Forwarded by Akiel Ahmed/UK/IBM on 22/12/2015 10:47 -----
> >
> > From:   Akiel Ahmed/UK/IBM
> > To:     solr-user@lucene.apache.org
> > Date:   21/12/2015 11:16
> > Subject:        Re: Solr 6 Distributed Join
> >
> >
> > Thank you for the help.
> >
> > I am working through what I want to do with the join - will let you know
> > if I hit any issues.
> >
> >
> >
> > From:   Joel Bernstein <joelsolr@gmail.com>
> > To:     solr-user@lucene.apache.org
> > Date:   17/12/2015 15:40
> > Subject:        Re: Solr 6 Distributed Join
> >
> >
> >
> > One thing to note about the hashJoin is that it requires the search
> > results
> > from the hashed query to fit entirely in memory.
> >
> > The innerJoin does not have this requirement as it performs a streaming
> > merge join.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Dec 17, 2015 at 10:33 AM, Joel Bernstein <joelsolr@gmail.com>
> > wrote:
> >
> > > Below is an example of nested joins where the innerJoin is done in
> > > parallel using the parallel function. The partitionKeys parameter
> needs
> > to
> > > be added to the searches when the parallel function is used to
> partition
> > > the results across worker nodes.
> > >
> > > hashJoin(
> > >                 parallel(workerCollection,
> > >                             innerJoin(
> > >                                             search(users, q="*:*",
> > > fl="userId, full_name, hometown", sort="userId asc",
> zkHost="zk2:2345",
> > > qt="/export" partitionKeys="userId"),
> > >                                             search(reviews, q="*:*",
> > > fl="userId, review, score", sort="userId asc", zkHost="zk1:2345",
> > > qt="/export" partitionKeys="userId"),
> > >                                             on="userId"
> > >                                             ),
> > >                              workers="20",
> > >                              zkHost="zk1:2345",
> > >                              sort="userId asc"
> > >                              ),
> > >                hashed=search(restaurants, q="city:nyc",
> > fl="restaurantId, restaurantName",
> > > sort="restaurantId asc", zkHost="zk1:2345", qt="/export"),
> > >                on="restaurantId"
> > > )
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Thu, Dec 17, 2015 at 10:29 AM, Joel Bernstein <joelsolr@gmail.com>
> > > wrote:
> > >
> > >> The innerJoin joins two streams sorted by the same join keys (merge
> > >> join). If third stream has the same join keys you can nest
> innerJoins.
> > But
> > >> all three tables need to be sorted by the same join keys to nest
> > innerJoins
> > >> (merge joins).
> > >>
> > >> innerJoin(innerJoin(...),
> > >>                 search(...),
> > >>                 on...)
> > >>
> > >> If the third stream is joined on a different key you can nest inside
> a
> > >> hashJoin which doesn't require streams to be sorted on the join key.
> > For
> > >> example:
> > >>
> > >> hashJoin(innerJoin(...),
> > >>                 hashed=search(...),
> > >>                 on..)
> > >>
> > >>
> > >> Joel Bernstein
> > >> http://joelsolr.blogspot.com/
> > >>
> > >> On Thu, Dec 17, 2015 at 9:28 AM, Akiel Ahmed <AHMEDAKI@uk.ibm.com>
> > wrote:
> > >>
> > >>> Hi again,
> > >>>
> > >>> I got the join to work. A team mate pointed out that one of the
> search
> > >>> functions in the innerJoin query was missing a field in the join -
> > adding
> > >>> the e1 field to the fl parameter of the second search function gave
> > the
> > >>> result I expected:
> > >>>
> > >>>
> > >>>
> >
> >
>
> http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted
>
> >
> > >>>
> > >>> , fl="id", q=text:John, sort="id
> > >>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
> > >>> fl="id,e1", q=text:Friends, sort="id
> > >>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
> > >>>
> > >>> I am still interested in whether we can specify a join, using an
> > >>> arbitrary
> > >>> number of searches.
> > >>>
> > >>> Cheers
> > >>>
> > >>> Akiel
> > >>>
> > >>>
> > >>>
> > >>> From:   Akiel Ahmed/UK/IBM@IBMGB
> > >>> To:     solr-user@lucene.apache.org
> > >>> Date:   16/12/2015 17:05
> > >>> Subject:        Re: Solr 6 Distributed Join
> > >>>
> > >>>
> > >>>
> > >>> Hi Dennis,
> > >>>
> > >>> Thank you for your help. I used your explanation to construct an
> > >>> innerJoin
> > >>>
> > >>> query; I think I am getting further but didn't get the results I
> > >>> expected.
> > >>>
> > >>> The following describes what I did – is there any chance you can
> tell
> > >>> where I am going wrong:
> > >>>
> > >>> Solr 6 Developer Builds: #2738 and #2743
> > >>>
> > >>> 1. Modified server/solr/configsets/basic_configs/conf/managed-schema
> > so
> > >>> it
> > >>>
> > >>> reads:
> > >>>
> > >>> <?xml version="1.0" encoding="UTF-8" ?>
> > >>> <schema name="search" version="1.5">
> > >>>   <uniqueKey>id</uniqueKey>
> > >>>   <field name="id" type="id" indexed="true" stored="true"
> > required="true"
> > >>> multiValued="false" docValues="true"/>
> > >>>   <field name="_version_" type="solr_version" indexed="true"
> > >>> stored="true"
> > >>>
> > >>> required="false" multiValued="false" docValues="true"/>
> > >>>   <field name="type" type="id" indexed="true" stored="true"
> > >>> required="false" multiValued="false" docValues="true"/>
> > >>>   <field name="e1" type="id" indexed="true" stored="true"
> > >>> required="false"
> > >>>
> > >>> multiValued="false" docValues="true"/>
> > >>>   <field name="e2" type="id" indexed="true" stored="true"
> > >>> required="false"
> > >>>
> > >>> multiValued="false" docValues="true"/>
> > >>>   <field name="text" type="free_text" indexed="true" stored="true"
> > >>> required="false" multiValued="false"/>
> > >>>   <fieldType name="id" class="solr.StrField"
> sortMissingLast="true"/>
> > >>>   <fieldType name="solr_version" class="solr.TrieLongField"
> > >>> precisionStep="0" positionIncrementGap="0"/>
> > >>>   <fieldType name="free_text" class="solr.TextField"
> > >>> positionIncrementGap="100">
> > >>>     <analyzer>
> > >>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >>>       <filter class="solr.LowerCaseFilterFactory"/>
> > >>>       <filter class="solr.WordDelimiterFilterFactory"
> > >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > >>> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
> > >>>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> > >>> words="lang/stopwords_en.txt"/>
> > >>>     </analyzer>
> > >>>   </fieldType>
> > >>> </schema>
> > >>>
> > >>> 2. Modified
> server/solr/configsets/basic_configs/conf/solrconfig.xml,
> > >>> adding the following near the bottom of the file so it is the last
> > >>> request
> > >>>
> > >>> handler
> > >>>
> > >>>   <requestHandler name="/stream" class="solr.StreamHandler">
> > >>>         <lst name="invariants">
> > >>>                 <str name="wt">json</str>
> > >>>                 <str name="distrib">false</str>
> > >>>         </lst>
> > >>>   </requestHandler>
> > >>>
> > >>> 3. Used solr -e cloud to setup a solr cloud instance, picking all
> the
> > >>> defaults except I chose basic_configs
> > >>>
> > >>> 4. After solr is running I ingested the following data via the Solr
> > Web
> > >>> UI
> > >>>
> > >>> (/update handler, Document Type = CSV)
> > >>> id,type,e1,e2,text
> > >>> 1,ABC,,,John Smith
> > >>> 2,ABC,,,Jane Smith
> > >>> 3,ABC,,,MiKe Smith
> > >>> 4,ABC,,,John Doe
> > >>> 5,ABC,,,Jane Doe
> > >>> 6,ABC,,,MiKe Doe
> > >>> 7,ABC,,,John Smith
> > >>> 8,DEF,,,Chicken Burger
> > >>> 9,DEF,,,Veggie Burger
> > >>> 10,DEF,,,Beef Burger
> > >>> 11,DEF,,,Chicken Donar
> > >>> 12,DEF,,,Chips
> > >>> 13,DEF,,,Drink
> > >>> 20,GHI,1,2,Friends
> > >>> 21,GHI,3,4,Friends
> > >>> 22,GHI,5,6,Friends
> > >>> 23,GHI,7,6,Friends
> > >>> 24,GHI,6,4,Friends
> > >>> 25,JKL,1,8,Order
> > >>> 26,JKL,2,9,Order
> > >>> 27,JKL,3,10,Order
> > >>> 28,JKL,4,11,Order
> > >>> 29,JKL,5,12,Order
> > >>> 30,JKL,6,13,Order
> > >>>
> > >>> 5. Navigating to the following URL in a browser returned an expected
> > >>> result:
> > >>> http://localhost:8983/solr/gettingstarted/select?q={!join from=id
> > >>> to=e1}text:John&fl="id"
> > >>>
> > >>> <response>
> > >>> ...
> > >>>   <result>
> > >>>     <doc>
> > >>>       <str name="id">20</str>
> > >>>       <str name="e1">1</str>
> > >>>       <str name="e2">2</str>
> > >>>       ...
> > >>>     </doc>
> > >>>     <doc>
> > >>>       <str name="id">28</str>
> > >>>       <str name="e1">4</str>
> > >>>       <str name="e2">11</str>
> > >>>       ...
> > >>>     </doc>
> > >>>     <doc>
> > >>>       <str name="id">23</str>
> > >>>       <str name="e1">7</str>
> > >>>       <str name="e2">6</str>
> > >>>       ...
> > >>>     </doc>
> > >>>   </result>
> > >>> </response>
> > >>>
> > >>> 6. Navigating to the following URL in a browser does NOT return what
> I
> > >>> expected:
> > >>>
> > >>>
> >
> >
>
> http://localhost:8983/solr/gettingstarted/stream?stream=innerJoin(search(gettingstarted
>
> >
> > >>>
> > >>> , fl="id", q=text:John, sort="id
> > >>> asc",zkHost="localhost:9983",qt="/export"), search(gettingstarted,
> > >>> fl="id", q=text:Friends, sort="id
> > >>> asc",zkHost="localhost:9983",qt="/export"), on="id=e1")
> > >>>
> > >>> {"result-set":{"docs":[
> > >>> {"EOF":true,"RESPONSE_TIME":124}]}}
> > >>>
> > >>>
> > >>> I also have a join related question. Is there any chance I can
> specify
> > a
> > >>> query and join for more than 2 things. For example:
> > >>>
> > >>> innerJoin(search(gettingstarted, fl="id", q=text:John, ...) as s1,
> > >>>           search(gettingstarted, fl="id", q=text:Chicken, ...) as s2
> > >>>           search(gettingstarted, fl="id", q=text:Friends, ...) as
> s3)
> > >>>           on="s1.id=s3.e1",
> > >>>           on="s2.id=s3.e2")
> > >>>
> > >>> Sorry if the query does not make sense, but given the data above my
> > >>> intention is to find a single result made up of 3 documents:
> > >>> s1.id=1,s2.id=8,s3.id=25
> > >>> Is that possible? If yes, will Solr 6 support an arbitrary number of
> > >>> queries and associated joins?
> > >>>
> > >>> Cheers
> > >>>
> > >>> Akiel
> > >>>
> > >>>
> > >>>
> > >>> From:   Dennis Gove <dpgove@gmail.com>
> > >>> To:     Akiel Ahmed/UK/IBM@IBMGB, solr-user@lucene.apache.org
> > >>> Date:   11/12/2015 15:34
> > >>> Subject:        Re: Solr 6 Distributed Join
> > >>>
> > >>>
> > >>>
> > >>> Akiel,
> > >>>
> > >>> Without seeing your full url I assume that you're missing the
> > >>> stream=innerJoin(.....) part of it. A full sample url would look
> like
> > >>> this
> > >>>
> >
> http://localhost:8983/solr/careers/stream?stream=innerJoin(search(careers
> > >>> ,
> > >>> fl="personId,companyId,title", q=companyId:*, sort="companyId
> > >>> asc",zkHost="localhost:2181",qt="/export"),search(companies,
> > >>> fl="id,companyName", q=*:*, sort="id
> > >>> asc",zkHost="localhost:2181",qt="/export"),on="companyId=id")
> > >>>
> > >>> This example will return a join of career records with the company
> > name
> > >>> for
> > >>> all career records with a non-null companyId.
> > >>>
> > >>> And the pieces have the following meaning:
> > >>> http://localhost:8983/solr/careers/stream?  - you have a collection
> > >>> called
> > >>> careers available on localhost:8983 and you're hitting its stream
> > handler
> > >>> ?stream=  - you are passing the stream parameter to the stream
> handler
> > >>> zkHost="localhost:2181"  - there is a zk instance running on
> > >>> localhost:2181
> > >>> where solr can get clusterstate information. Note, that since you're
> > >>> sending the request to the careers collection this param is not
> > required
> > >>> in
> > >>> the search(careers....) part but is required in the
> > search(companies....)
> > >>> part. For simplicity I usually just provide it for all.
> > >>> qt="/export"  - tells solr to use the export handler. this assumes
> all
> > >>> your
> > >>> fields are in docValues. if you'd rather not use the export handler
> > then
> > >>> you probably want to provide the rows=##### param to tell solr to
> > return
> > >>> a
> > >>> large # of rows for each underlying search. Without it solr will
> > default
> > >>> to, I believe, 10 rows.
> > >>>
> > >>> CCing the user list so others can see this as well.
> > >>>
> > >>> We're working on additional documentation for Streaming Aggregation
> > and
> > >>> Expressions. The page can be found at
> > >>>
> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> > >>> but
> > >>> it's missing a lot of things we've added recently.
> > >>>
> > >>> - Dennis
> > >>>
> > >>> On Fri, Dec 11, 2015 at 9:51 AM, Akiel Ahmed <AHMEDAKI@uk.ibm.com>
> > >>> wrote:
> > >>>
> > >>> > Hi,
> > >>> >
> > >>> > Sorry, this is out of the blue - I have joined the Solr mailing
> > list,
> > >>> but
> > >>> > I don't know if that it is the correct place to ask my question.
> If
> > you
> > >>> are
> > >>> > not the best person to talk to can you please point me in the
> right
> > >>> > direction.
> > >>> >
> > >>> > I want to try using the Solr 6 distributed joins but cant find
> > enough
> > >>> > material on the web to make it work. I have added the stream
> handler
> > to
> > >>> my
> > >>> > solrconfig.xml (see below) and when issuing an inner join query
> (see
> > >>> below)
> > >>> > I get a an error - the localparm named stream is missing so I
get
> a
> > >>> > NullPointerException. Is there a way to play with the join via
the
> > Solr
> > >>> web
> > >>> > UI, or if not do you have a code snippet via a SolrJ client that
> > >>> performs a
> > >>> > join?
> > >>> >
> > >>> > solrconfig.xml
> > >>> >
> > >>> > <requestHandler name="/stream" class="solr.StreamHandler">
> > >>> >         <lst name="invariants">
> > >>> >                 <str name="wt">json</str>
> > >>> >                 <str name="distrib">false</str>
> > >>> >         </lst>
> > >>> > </requestHandler>
> > >>> >
> > >>> > query
> > >>> > innerJoin(
> > >>> >         search(getting_started, _search_field:john),
> > >>> >         search(getting_started, _search_field:friends),
> > >>> >         on="id=_link_from_id")
> > >>> >
> > >>> > Cheers
> > >>> >
> > >>> > Akiel
> > >>> > Unless stated otherwise above:
> > >>> > IBM United Kingdom Limited - Registered in England and Wales with
> > >>> number
> > >>> > 741598.
> > >>> > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire
> > PO6
> > >>> 3AU
> > >>> >
> > >>>
> > >>>
> > >>> Unless stated otherwise above:
> > >>> IBM United Kingdom Limited - Registered in England and Wales with
> > number
> > >>> 741598.
> > >>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire
> PO6
> > >>> 3AU
> > >>>
> > >>>
> > >>>
> > >>> Unless stated otherwise above:
> > >>> IBM United Kingdom Limited - Registered in England and Wales with
> > number
> > >>> 741598.
> > >>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire
> PO6
> > >>> 3AU
> > >>>
> > >>>
> > >>
> > >
> >
> >
> > Unless stated otherwise above:
> > IBM United Kingdom Limited - Registered in England and Wales with number
> > 741598.
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
> 3AU
> >
> > Unless stated otherwise above:
> > IBM United Kingdom Limited - Registered in England and Wales with number
> > 741598.
> > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
> 3AU
> >
> >
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message