lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank DeRose <fder...@guidewire.com>
Subject Join support across multiple document types in Lucene
Date Fri, 29 Jun 2012 19:13:59 GMT
All,

I have a question about join support across multiple document types in Solr/Lucene. Let me
lay out the use case.

Suppose I have 3 tables:


*         Table A has 3 columns, id, a1, a2.

*         Table B has 4 columns, id, b1, b2, and aid, which is a foreign key referencing A.id.

*         Table C has 4 columns, id, c1, c2, and aid, which is a foreign key referencing A.id.

I want to be able to perform the following searches:


*         Search for rows in A by specifying just values for columns in A. For example,

select * from A where A.a1 = 'value'


*         Search for rows in A by specifying just values for columns in B or C or both. For
example,

select A.*, B.* from A, B where B.b1 = 'value' and B.aid = A.id
select A.*, C.* from A, C where C.c1 = 'value' and C.aid = A.id
select A.*, B.*, C.* from A, B, C where B.b1 = 'value' and B.aid = A.id and C.c1 = 'value'
and C.aid = A.id

Suppose that I want to store the data from A, B, and C in Solr/Lucene. How would I perform
these searches in a Solr/Lucene environment?

It seems that there are two possible approaches:


1.)     Denormalize all data into one document. That is, my query in data-config.xml for doing
a full-import would be:

select A.id, A.a1, A.a2, B.b1, B.b2, C.c1, C.c2 from A inner join B on B.aid = A.id inner
join C on C.aid = A.id

I believe this means that the number of documents in my Lucene index will be on the order
of the product:

cardinality(A) * cardinality(B) * cardinality(C)

This will result in a large amount of redundant data in my index.


2.)    Store the data from each table into a separate document, say, docA, docB, docC. This
would require me to perform three separate searches and to join the result based on the A.id,
B.aid, C.aid columns.

I am (dimly?) aware that the Solr/Lucene community is working on various solutions to this
problem. For example, I've read Mike McCandless' description of the BlockJoinQuery<http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html>.
This approach does not seem to solve our problem since (unless I am mistaken) the query requires
at least one predicate to be specified for parent entity (A in my example). We, on the other
hand, want to be able to have the ability to perform searches where only predicates for the
child entities (B and C in my example) are specified. To give a concrete example, Table A
might be a Claim table and Table B might be a Contact table and we want to search for Claims
based on Contact info, for example: search for all claims where the lastName of a Contact
matches 'DeRose'. Is my analysis correct? That is, is BlockJoinQuery only unidirectional from
parent to child?

On the other hand, Lucene "query time joining" discussed here<http://www.searchworkings.org/blog/-/blogs/query-time-joining-in-lucene>
seems to address our problem. The following paragraph seems to imply that queries can be specified
in terms of data contained in the child documents:

You could also change the example and give all articles that match with a certain comment
query. In this example the multipleValuesPerDocument is set to false and the fromField  (the
id field) only contains one value per document. However, the example would still work if multipleValuesPerDocument
 variable were set to true, but it would then work in a less efficient manner.

That is, Lucene "query time joining" is bidirectional. Of course, this begs the question:
How efficient are these queries. The reason why we thought about moving these queries from
our RDBMS to Solr/Lucene is because executing equivalent queries in the RDBMS sometimes produced
pathological worst-case behavior (queries taking 10's of minutes). Are we going to encounter
the same problems in Solr/Lucene?

So, any comments on the correctness of my analysis and any pointers to applicable resources
that discuss this problem are appreciated.

F

_________________________________________
Frank DeRose
Guidewire Software | Senior Software Engineer
Cell: 510 -589-0752
fderose@guidewire.com<mailto:fderose@guidewire.com> | www.guidewire.com<http://www.guidewire.com/>
Deliver insurance your way with flexible core systems from Guidewire.



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message