lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Murzaku" <>
Subject RE: Very large queries?
Date Thu, 27 Mar 2003 20:59:36 GMT
Is this some kind of topic categorization in which you "query" the
taxonomy with documents? If yes, I have done something similar with
e-mail categorization and it worked fine (even though it sounds like
abuse). Also, you could decrease the number of query terms by submitting
only the list of unique terms and corresponding frequencies as weights.

-----Original Message-----
From: [] 
Sent: Thursday, March 27, 2003 3:32 PM
Subject: Very large queries?

Caveat:  I have not yet installed Lucerne or begun to experiment with it
yet.  I have scanned the FAQ, but don't see anything that addresses this
question.  Pardon the somewhat slow buildup to the question below, but I
want to set the context.

I am developing an application for 'text mining' adverse event reports
in the pharmaceutical industry.  The querying will be driven by
'dictionaries', 'thesauri',  'taxonomies' or 'ontologies' (pick your
favorite) of drug names, compounds, and medical conditions.  These
thesauri are quite large.  For example, our drug name thesaurus is on
the order of 60,000+ terms.

I was planning on using Verity to accomplish my first approach to
shallow text mining since Verity is our corproate-wide search engine
technology and it supports a number of relevant features (including
'topic sets' for representing the taxonomies).  However, Verity imposes
restrictions on the size of topic sets that currently prohibit me from
using it with our large taxonomies.  It is not obvious that they will be
able to fix this problem in the timeframe I need.  Thus I am turning to
other alternatives, and Lucerne appears to be one.

So given that context, my question is this:  Does anyone on this list
have experience attempting to use very large queries (potentially
thousands or tens of thousands of terms) in Lucerne?  Does anyone have
any knowledge of design or implementation details that would inhibit the
use of such queries?  Does anyone have any idea of what the performance
would be like in retrieving via such queries?

Gary H. Merrill
Director and Principal Scientist, New Applications
Data Exploration Sciences
GlaxoSmithKline Inc.
(919) 483-8456

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message