lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Join support across multiple document types in Lucene
Date Fri, 29 Jun 2012 20:47:29 GMT
Performance here is an issue. The performance of Solr/Lucene's query joins is a
function of the number of _unique_ values in the fields being joined, and can
be unacceptably slow in the wild, it depends on your use-case.

denormalization is the usual  approach, but that gives DB folks the hives. But
how many documents would that amount to? You haven't said how big
things are, and
Solr handles very large numbers of documents reasonably well. There
are some other
techniques, but it largely depends on _why_ you're doing joins. What's
the use-case
here? There may be clever indexing you can do that will ameliorate the problem.

For instance, some people have used a hybrid approach, using Solr to do the
searching part on the textual data and pulling the joined bits from the DB. This
can make sense and it can be totally idiotic as a proposal It Depends (tm)....

Best
Erick

On Fri, Jun 29, 2012 at 3:13 PM, Frank DeRose <fderose@guidewire.com> wrote:
> All,
>
> I have a question about join support across multiple document types in Solr/Lucene. Let
me lay out the use case.
>
> Suppose I have 3 tables:
>
>
> *         Table A has 3 columns, id, a1, a2.
>
> *         Table B has 4 columns, id, b1, b2, and aid, which is a foreign key referencing
A.id.
>
> *         Table C has 4 columns, id, c1, c2, and aid, which is a foreign key referencing
A.id.
>
> I want to be able to perform the following searches:
>
>
> *         Search for rows in A by specifying just values for columns in A. For example,
>
> select * from A where A.a1 = 'value'
>
>
> *         Search for rows in A by specifying just values for columns in B or C or
both. For example,
>
> select A.*, B.* from A, B where B.b1 = 'value' and B.aid = A.id
> select A.*, C.* from A, C where C.c1 = 'value' and C.aid = A.id
> select A.*, B.*, C.* from A, B, C where B.b1 = 'value' and B.aid = A.id and C.c1 = 'value'
and C.aid = A.id
>
> Suppose that I want to store the data from A, B, and C in Solr/Lucene. How would I perform
these searches in a Solr/Lucene environment?
>
> It seems that there are two possible approaches:
>
>
> 1.)     Denormalize all data into one document. That is, my query in data-config.xml
for doing a full-import would be:
>
> select A.id, A.a1, A.a2, B.b1, B.b2, C.c1, C.c2 from A inner join B on B.aid = A.id inner
join C on C.aid = A.id
>
> I believe this means that the number of documents in my Lucene index will be on the order
of the product:
>
> cardinality(A) * cardinality(B) * cardinality(C)
>
> This will result in a large amount of redundant data in my index.
>
>
> 2.)    Store the data from each table into a separate document, say, docA, docB, docC.
This would require me to perform three separate searches and to join the result based on the
A.id, B.aid, C.aid columns.
>
> I am (dimly?) aware that the Solr/Lucene community is working on various solutions to
this problem. For example, I've read Mike McCandless' description of the BlockJoinQuery<http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html>.
This approach does not seem to solve our problem since (unless I am mistaken) the query requires
at least one predicate to be specified for parent entity (A in my example). We, on the other
hand, want to be able to have the ability to perform searches where only predicates for the
child entities (B and C in my example) are specified. To give a concrete example, Table A
might be a Claim table and Table B might be a Contact table and we want to search for Claims
based on Contact info, for example: search for all claims where the lastName of a Contact
matches 'DeRose'. Is my analysis correct? That is, is BlockJoinQuery only unidirectional from
parent to child?
>
> On the other hand, Lucene "query time joining" discussed here<http://www.searchworkings.org/blog/-/blogs/query-time-joining-in-lucene>
seems to address our problem. The following paragraph seems to imply that queries can be specified
in terms of data contained in the child documents:
>
> You could also change the example and give all articles that match with a certain comment
query. In this example the multipleValuesPerDocument is set to false and the fromField  (the
id field) only contains one value per document. However, the example would still work if multipleValuesPerDocument
 variable were set to true, but it would then work in a less efficient manner.
>
> That is, Lucene "query time joining" is bidirectional. Of course, this begs the question:
How efficient are these queries. The reason why we thought about moving these queries from
our RDBMS to Solr/Lucene is because executing equivalent queries in the RDBMS sometimes produced
pathological worst-case behavior (queries taking 10's of minutes). Are we going to encounter
the same problems in Solr/Lucene?
>
> So, any comments on the correctness of my analysis and any pointers to applicable resources
that discuss this problem are appreciated.
>
> F
>
> _________________________________________
> Frank DeRose
> Guidewire Software | Senior Software Engineer
> Cell: 510 -589-0752
> fderose@guidewire.com<mailto:fderose@guidewire.com> | www.guidewire.com<http://www.guidewire.com/>
> Deliver insurance your way with flexible core systems from Guidewire.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message