lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Newbie Questions: Site Scoping, Page Type Filtering/Sorting, Localization, Clustering
Date Mon, 31 May 2004 13:14:13 GMT
On May 30, 2004, at 10:34 PM, Sasha Haghani wrote:
> I am a newbie to Lucene and I'm considering using it in an upcoming 
> project.
> I've read through the documentation but I still have a number of 
> questions:

I'll do my best with some pointers below...

> 1. SEGMENTING AN INDEX & QUERIES BY SITE SCOPE
> In my use case, I have a number of logical websites backed by the same
> underlying content store.  A Document may be ultimately end up 
> "belonging"
> to one or more logical sites, but at a distinct URL for each.  The
> simplistic solution is to maintain indices for each logical site, but 
> this
> will result in some unwanted duplication and the need to update 
> multiple
> indices on "shared" content changes.  Other than that, can anyone 
> suggest
> approaches for how to segment a single index to accomodate multiple 
> logical
> sites and allow queries within a particlar site's scope?  Are fields 
> the
> solution?  How should the distinct per-site URLs be managed?

I don't think there is a definitive "best" way to do this.  Per-site 
indexes is one option.  Using a "site" field is another.  Queries for a 
particular site could be done either by using QueryFilter or by 
wrapping all queries in a BooleanQuery with a required TermQuery for 
the "site".

Sites could share documents by simply adding multiple 
Field.Keyword("site", site) to the documents.

> 2. LOCALIZED CONTENT
> I understand that at its core, Lucene can support content from any 
> locale
> and character set supported by Java.  What is the best way of 
> implementing
> Lucene to handle a content base which includes numerous locales.  One 
> index
> per locale or should all Documents be placed in a single index and 
> tagged
> with a "locale" field?  Or is there another approach altogether?

Again, there isn't really a "best" way, I don't think.  How does the 
locale situation relate to the previously mentioned site separation?  A 
"locale" field is a perfectly reasonable way to go also.  I don't know 
of any other approach.

> 3. DOCUMENT URLS
> Is the URL at which the original document can be retrieved generally 
> (i.e.,
> for linking search results to the original doc) stored as a non-index,
> non-tokenized, stored Field in the Document?

It depends on whether you want to query for it or not.  Field.Keyword 
if you want to be able to query for it.  Field.UnIndexed if you want it 
with the attributes you specified.

> 4. QUERY FILTERING & SORTING BY FIELD VALUE
> In my application I have a pretty typical need to distinguish between
> different document types (e.g., FAQs, Articles, Reviews, etc.) in 
> order to
> allow the user to restrict their results to particular types of 
> documents or
> to sort results by type.  Are fields again the solution for this?  Can
> Queries filter or sort results/hits on exact field values (i.e.,
> non-tokenized field values).

Fields are generally the solution :)  What else is there?  Documents 
have Fields.  Fields are where you put metadata about documents.  A 
document type makes perfect sense to put in a field.

QueryFilter or the BooleanQuery AND trick mentioned above would allow 
you to narrow results down to a particular set of types.  Sorting works 
on exact values, yes, and you can write your own sorting implementation 
if lexicographic or numeric sorting are not sufficient which could key 
off external information if needed.  To sort on a field, it needs to be 
indexed and non-tokenized (stored is irrelevant).  There must be only a 
single term for that field in a document.  Check the Javadocs for the 
Sort class for more details on the sorting requirements.

> 5. DEPLOYING LUCENE IN A CLUSTERED WEB-APP ENVIRONMENT
> How is Lucene to be deployed in a clustered web-app environment?  Do 
> all
> cluster nodes require access to a networked filesystem containing the 
> index
> files or is there another solution?  How is concurrency managed when 
> the
> index is being incrementally updated?

This is entirely up to you to manage.  I'm sure developers building 
solutions with Lucene have employed all sorts of various architectures.

Concurrency is managed via lock files that need to be shared among apps 
interacting with the index.  The short answer is only a single process 
(but multiple threads sharing an IndexWriter) can index at a time.  You 
would probably want to build some sort of queuing infrastructure and 
have a single indexer, or index into separate indexes and merge them.

> Any answers and suggestions are much appreciated.  Thanks.

I hope this helps some.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message