lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Ferguson" <>
Subject Re: Searching repeating fields
Date Tue, 18 Nov 2008 21:49:21 GMT
I'll provide a better example, perhaps it will help in formulating a

Suppose I am designing an index that stores invoices. One document
corresponds to one invoice, which has a unique id. Any number of employees
can make comments on the invoices, and comments have different
classifications (request_for_approval, redirection, approval,
miscellaneous). Each comment is timestamped. An invoice also contains a long
description that is indexed and is stored.

So an example document may look like this:

invoice_id: 1234
invoice_description:(some text)
employee_id: 5
employee_id: 8
employee_id: 12
comment_type: request_for_approval
comment_type: redirection
comment_type: approval
comment: please approve invoice
comment: sending invoice to sales
comment: invoice approved

I want to be able to search by any number of these fields. For example, I
may want all of employee 5's requests for approvals from today.

It may seem like it would be simpler to just have two separate indexes: a
comments index and an invoice index. But I also want to be able to search
the invoice description along with the comments. I could set the granularity
of the index to the comments level, but then I am duplicating a lot of text
in the invoice description. Also, I only care about returning the invoice,
so I will have to merge results if the granularity is set to the comments
level, which will ruin Lucene's scoring (?).

This is a made-up example, but I think it describes pretty thoroughly the
problem I'm trying to solve. In my real world problem, I'm storing the
full-text of web pages, and I really don't want to be duplicating that much
text to set the granularity lower.

Mark Ferguson

On Tue, Nov 18, 2008 at 2:29 PM, Mark Ferguson <>wrote:

> Thanks for the suggestion, but I think I will need a more robust solution,
> because this will only work with pairs of fields. I should have specified
> that the example I gave was somewhat contrived, but in practice there could
> be more than two parallel fields. I'm trying to find a general solution that
> I can apply to any number of parallel fields holding any kind of data.
> I was thinking of trying something along the lines of a multi-value field.
> So for example, I could have:
> page_user_title: ajax|news (where | is a field separator)
> The problem is I don't know how to formulate the query that would be
> equivalent to +username:ajax +page_title:news, or if it's even possible. (I
> should also mention that I am creating the queries programmatically, not
> using the query parser, so anything goes).
> Any other ideas?
> Mark Ferguson
> On Tue, Nov 18, 2008 at 1:06 PM, Ian Lea <> wrote:
>> How about using variable field names?
>>  url:
>>  page_description: cnn breaking news
>>  page_title_ajax: news
>>  page_title_paris: cnn news
>>  page_title_daniel: homepage
>>  username: ajax
>>  username: paris
>>  username: daniel
>> and search for +user:ajax +page_title_ajax:news or maybe just
>> page_title_ajax:news.  Might not even need to store user.
>> --
>> Ian.
>> On Tue, Nov 18, 2008 at 5:48 PM, Mark Ferguson
>> <> wrote:
>> > Hello,
>> >
>> > I am designing an index in which one url corresponds to one document.
>> Each
>> > document also contains multiple parallel repeating fields. For example:
>> >
>> > Document 1:
>> >  url:
>> >  page_description: cnn breaking news
>> >  page_title: news
>> >  page_title: cnn news
>> >  page_titel: homepage
>> >  username: ajax
>> >  username: paris
>> >  username: daniel
>> >
>> > In this contrived example, user 'ajax' have saved the URL with the page
>> > title 'news', 'paris' has saved it with 'cnn news', and 'daniel' has
>> saved
>> > it with 'homepage'.
>> >
>> > What I need to be able to do is perform a search for a particular user
>> and a
>> > particular title, but they must occur together. For example, +user:ajax
>> > +page_title:news would return this document, but +user:ajax
>> > +page_title:homepage would not.
>> >
>> > I am open to changing the design of the document (i.e. using repeating
>> > fields isn't required), but I do need to have one document per url. I am
>> > looking for suggestions for a strategy on implementing this requirement.
>> >
>> > Thanks,
>> >
>> > Mark Ferguson
>> >
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message