lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Implementing CMS search function using Lucene
Date Sun, 13 Apr 2008 10:20:01 GMT
For starters, you might have a look at Jackrabbit (Content Repo. built  
on Lucene) as I know it powers several CMS systems.

More below.

On Apr 3, 2008, at 8:24 AM, Илья Казначеев wrote:

> Hello.
> We've designing a CMS in Java, and I've trying to implement site  
> search
> function using lucene.
> The basic conception is that:
> - Site features numerous objects that we'd like to throw into index:  
> pages,
> various text blocks on those pages, descriptions and keyword lists  
> of those
> pages, static bits of html, goods sections with goods inside them,  
> etc, etc.
> - There would be a search form that would be occasionally used by site
> visitors.
> - Visitors are highly unlikely to use advanced queries. I assume 95%  
> queries
> would be either a few keywords or a phrase to search. We have to  
> find the
> best matches for such queries.
> - The thing I want to introduce is "phrase in quotes" to search for  
> exact
> phrase. Also, most our sites are in Russian, so some, even if  
> rudimentary,
> support for Russian morphology is a plus.
> I've dug into examples and have a following set of questions:
> - Our objects are fairly structured, so I would like to introduce a  
> lot of
> fields, something like five different for each object type.
> But, as far as I see, all Queries are going to search only one field.
> This is certainly bad because users surely want to search *all* the  
> fields at
> once. The aren't going to bother with queries.

You can search whatever fields you want.  It's all a question of how  
you generate your queries.  Typically, one has an "all" field as well.

> Maybe I can add queries over every field joined by 'or' operation, but
> wouldn't that be too slow?

Probably not.  I suppose if you are talking hundreds of fields

> I don't want it to work more than half second on
> reasonable sized index. Also, I don't want to hard-code exact list  
> of fields,
> I might add them as I develop the system. Is this doable, would that  
> work? Or
> I'll have to stuff all text content from object into one blob-field  
> and query
> that? Which way is more reasonable?

I'd probably do both, then you can handle generic queries as well as  
field specific ones.

> - Our objects have their hierarchy, e.g., blocks belong to page. Is  
> there a
> way to make Lucene govern parent-child relation, somehow summing  
> hits in all
> childs to find the best-matching parent? I assume, no, then is there  
> a way
> for me to go thru matching documents list, reducing it by 'adding'  
> blocks'
> scores to find the best matching page?

You have to manage parent-child yourself.  I am pretty sure Jackrabbit  
does this.

> - Is there a way to set weights for different fields? Let's say,  
> content have
> a weight of 1, title have a weight of 5 and picture subscribe have a  
> weight
> of 0.5. If no, can I do that by hand?


> - Is there something to support Russian morphology (it's all like  
> "the last n
> letters of a word might change, we should match all forms") for either
> indexer or searcher?

Check contrib/analyzers, I see some Russian analyzers in there, but I  
can't speak to the quality.

> Maybe "inexact match", QueryParser's ~ operator, would
> be enough? I heard Nutch project have something like that, but I  
> wonder if I
> would be able to reuse parts of Nutch, and I surely can't use Nutch  
> as a
> whole.

Not sure.

> If there are another considerations, they're welcome.
> Thanks for your probable replies.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Lucene Helpful Hints:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message