Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A1D8ED13B for ; Fri, 29 Jun 2012 19:14:33 +0000 (UTC) Received: (qmail 33196 invoked by uid 500); 29 Jun 2012 19:14:31 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 32672 invoked by uid 500); 29 Jun 2012 19:14:30 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 32283 invoked by uid 99); 29 Jun 2012 19:14:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 19:14:30 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of fderose@guidewire.com designates 204.154.121.39 as permitted sender) Received: from [204.154.121.39] (HELO mx-1.guidewire.com) (204.154.121.39) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 19:14:24 +0000 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArcEAM397U/AqAeF/2dsb2JhbABCA4JFqxoBiX2CGgUtHCkZAQ8GFRYBBjkUEwQTCAELB4d7vCmOIIJBYAOWRYRmjGo X-IronPort-AV: E=Sophos;i="4.77,499,1336374000"; d="scan'208,217";a="689665" From: Frank DeRose To: "java-user@lucene.apache.org" Subject: Join support across multiple document types in Lucene Thread-Topic: Join support across multiple document types in Lucene Thread-Index: AQHNVitWZjX24/9NfESbW/Ye62+ZTA== Date: Fri, 29 Jun 2012 19:13:59 +0000 Message-ID: <69C3E7DDF0760840807F5FBADDD40C3E171A2D75@sm-ex-02-vm.guidewire.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Type: multipart/alternative; boundary="_000_69C3E7DDF0760840807F5FBADDD40C3E171A2D75smex02vmguidewi_" MIME-Version: 1.0 --_000_69C3E7DDF0760840807F5FBADDD40C3E171A2D75smex02vmguidewi_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable All, I have a question about join support across multiple document types in Solr= /Lucene. Let me lay out the use case. Suppose I have 3 tables: * Table A has 3 columns, id, a1, a2. * Table B has 4 columns, id, b1, b2, and aid, which is a foreign ke= y referencing A.id. * Table C has 4 columns, id, c1, c2, and aid, which is a foreign ke= y referencing A.id. I want to be able to perform the following searches: * Search for rows in A by specifying just values for columns in A. = For example, select * from A where A.a1 =3D 'value' * Search for rows in A by specifying just values for columns in B o= r C or both. For example, select A.*, B.* from A, B where B.b1 =3D 'value' and B.aid =3D A.id select A.*, C.* from A, C where C.c1 =3D 'value' and C.aid =3D A.id select A.*, B.*, C.* from A, B, C where B.b1 =3D 'value' and B.aid =3D A.id= and C.c1 =3D 'value' and C.aid =3D A.id Suppose that I want to store the data from A, B, and C in Solr/Lucene. How = would I perform these searches in a Solr/Lucene environment? It seems that there are two possible approaches: 1.) Denormalize all data into one document. That is, my query in data-c= onfig.xml for doing a full-import would be: select A.id, A.a1, A.a2, B.b1, B.b2, C.c1, C.c2 from A inner join B on B.ai= d =3D A.id inner join C on C.aid =3D A.id I believe this means that the number of documents in my Lucene index will b= e on the order of the product: cardinality(A) * cardinality(B) * cardinality(C) This will result in a large amount of redundant data in my index. 2.) Store the data from each table into a separate document, say, docA, = docB, docC. This would require me to perform three separate searches and to= join the result based on the A.id, B.aid, C.aid columns. I am (dimly?) aware that the Solr/Lucene community is working on various so= lutions to this problem. For example, I've read Mike McCandless' descriptio= n of the BlockJoinQuery. This approach does not seem to solve our probl= em since (unless I am mistaken) the query requires at least one predicate t= o be specified for parent entity (A in my example). We, on the other hand, = want to be able to have the ability to perform searches where only predicat= es for the child entities (B and C in my example) are specified. To give a = concrete example, Table A might be a Claim table and Table B might be a Con= tact table and we want to search for Claims based on Contact info, for exam= ple: search for all claims where the lastName of a Contact matches 'DeRose'= . Is my analysis correct? That is, is BlockJoinQuery only unidirectional fr= om parent to child? On the other hand, Lucene "query time joining" discussed here seems to addres= s our problem. The following paragraph seems to imply that queries can be s= pecified in terms of data contained in the child documents: You could also change the example and give all articles that match with a c= ertain comment query. In this example the multipleValuesPerDocument is set = to false and the fromField (the id field) only contains one value per docu= ment. However, the example would still work if multipleValuesPerDocument v= ariable were set to true, but it would then work in a less efficient manner= . That is, Lucene "query time joining" is bidirectional. Of course, this begs= the question: How efficient are these queries. The reason why we thought a= bout moving these queries from our RDBMS to Solr/Lucene is because executin= g equivalent queries in the RDBMS sometimes produced pathological worst-cas= e behavior (queries taking 10's of minutes). Are we going to encounter the = same problems in Solr/Lucene? So, any comments on the correctness of my analysis and any pointers to appl= icable resources that discuss this problem are appreciated. F _________________________________________ Frank DeRose Guidewire Software | Senior Software Engineer Cell: 510 -589-0752 fderose@guidewire.com | www.guidewire.com Deliver insurance your way with flexible core systems from Guidewire. --_000_69C3E7DDF0760840807F5FBADDD40C3E171A2D75smex02vmguidewi_--