lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael J. Prichard" <>
Subject Re: When to use HitCollector?
Date Sun, 07 Jan 2007 23:18:45 GMT
Hey Erick (and List).

Yeah you have it pretty correct.  I am totally kicking myself in the ass 
because I wrote all the indexing stuff! of our indexes is 
about 8GB so it will take a while to rebuild. I figure I can put a 
temporary fix out and then refactor the index with the right info and 
make it more efficient.

I basically wrote the logic to search email with a FilteredQuery so it 
can remove what I don't want and then a seperate document query.  This 
is all in a BooleanQuery.   I then wrote my own code to parse through 
the hits.  It is not the best thing in the world but it seems to work ok.

I appreciate your help!


Erick Erickson wrote:

> I'm a little fuzzy on the structure of your index, but here's a stab....
> First, let me see if I understand your problem...
> For an e-mail, you have a body and an attachment that are indexed as
> separate lucene documents.
> For the body, you include from, to, cc (in other words, meta-data)
> For the attachment, you do NOT include the meta-data.
> For both the body and the attachment, you have an ID for the parent 
> e-mail
> that is the same for a body and attachment if they are from the same 
> e-mail
> (otherwise I don't see how you "determine the email to display" for an
> attachment).
> You've got a couple of problems here. Anything you do to break up the
> clauses into separate queries will do bad things for your relevance 
> scoring.
> That is, one query on the body and one on the attachment will give you 
> two
> lists that you'll then have to manually reconcile if relevancy matters.
> Depending upon how many emails and attachments you get hits for, you 
> could
> do something like
> 1> search for the body elements with the to/from/cc. Use the return 
> (perhaps
> with a HitCollector (definitely NOT a Hits object)) to assemble a clause
> like ID=52343 or ID=985 or ID=8910 .... and the re-submit some query like
> "text contains "search data" and (ID=52343 or ID=985 or ID=8910 ....)"
> BEWARE that, depending upon how many e-mails you get, you'll run afoul of
> TooManyClauses exceptions. The default is 1,024 but you can make it as 
> big
> as memory/time allows. And, as you say, this is temporary until you
> reconstruct your index.
> If this is totally irrelevant, perhaps you could add some more detail....
> Best
> Erick
> On 1/7/07, Michael J. Prichard <> wrote:
>> I have an index which has email and their attachments indexed.  This is
>> ok but the issue I am having it when I am trying to filter the
>> searches.  For example I can search the content of the email and the
>> document (i.e. the attachment) and return the right results.  Basically,
>> if it is a document I check the DB to see its parent and determine the
>> email to display.  The problem comes in when I try to use to, from
>> and/or cc in my searches.  It will only return emails since we did not
>> index those fields along with the attachments.  Ideally we would reindex
>> and add those but I need a temporary fix until we can do that.  SO...I
>> tried a few various things including a basic search and then filtering
>> on my own but that seriously slowed our interface since I had to check
>> each result, etc.  SO...I broke the query into the docs and
>> emails seperately and only check the documents on return.  That is ok.
>> I was wondering...would HitCollector be something i should use.
>> Basically have the searcher check documents to make sure they are ok to
>> go (i.e. to, from. etc is correct)?
>> Make sense?
>> Thanks!
>> Michael
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message