lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subodh Damle <...@damle.name>
Subject Searching repeating fields
Date Thu, 22 Jun 2006 19:18:11 GMT
Hi all.

We've been using Lucene to index our dynamic data structure and so far 
Lucene has been flexible enough to accommodate our requirements.

Now we have this requirement about searching repeating fields, whose 
implementation is not clear.

Our data records have a dynamic tree-like structure :
e.g. a portion of record will look like:
-company-data
--- financial-data
------revenue-info
--------year
--------amount

For above record portion we create Lucene fields 
"/company-data/financial-data/revenue-info/year" and 
"/company-data/financial-data/revenue-info/amount"

Here, the 'revenue-info' is a repeating node, so we can have records like :
Record 1
---financial-data
------revenue-info
--------year = 2000
--------amount = 1000000
------revenue-info
--------year = 2001
--------amount = 2000000

Record 2
---financial-data
------revenue-info
--------year = 2000
--------amount = 2000000


Now, we need to find records where 'year=2000' and 'amount=2000000' -- 
only those **belonging to same revenue-info node**.
So the search should match Record2 above , but NOT Record1.
Here, simple BooleanQuery using two Terms 
(/company-data/financial-data/revenue-info/year:2000 and  
/company-data/financial-data/revenue-info/amount:2000000)  cannot do 
this - since it will also match Record 1.

A couple of clarifications : each of the nodes (company-data, 
financial-data, revenue-info ... etc ) have physical records in DB and 
are individually accessible.
Also, we generate Lucene queries programmatically , we don't use the 
query parser at all.

We thought of a couple of approaches to implement this but both have 
some limitations :

Approach 1. Generate a separate dummy Document for each 'revenue-info' 
node That way, we can ensure that both fields matched belong to same 
revenue-info node. The 'revenue-info' Document can be then linked to 
actual record Document using some other 'id' field.
This approach would however increase index size (one additional Doc per 
repeating node ). [ We would have a few dozen repeating nodes in each 
record. ]

Approach 2 : Merge values of 'year' and 'amount' into a single field. 
e.g. /company-data/financial-data/revenue-info-dummy-field: = 2000#2000000
However, the problem is , we may need to do range queries for some 
fields -- so for queries like year =2000 AND amount>1000000, we can't 
search this composite field value. So this approach may be unusable for us.

Has someone encountered something similar ? Can it be implemented any 
other way ?

Any suggestion would be greatly appreciated.

-Subodh

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message