lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Analysers for newspaper pages...
Date Mon, 28 Nov 2011 20:51:13 GMT
You can easily use just the CommonGrams stuff from Solr in your pure
lucene project.

There are a couple of useful docs on stop words and common grams et al at

http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

--
Ian.

On Mon, Nov 28, 2011 at 8:31 PM, Dawn Zoë Raison <dawn@digitorial.co.uk> wrote:
> Hi Steve,
>
> On 28/11/2011 19:43, Steven A Rowe wrote:
>>
>> I assume that when you refer to "the impact of stop words," you're
>> concerned about query-time performance?  You should consider the possibility
>> that performance without removing stop words is good enough that you won't
>> have to take any steps to address the issue.
>
> Not to fussed about query-time performance; certainly no-one has complained
> so far. It's more the sheer number of junk pages we get searching on phrases
> that contain stop words - it can lead to hundreds of thousands of results,
> and the pedants among our userbase insist on paging through the lot :-|
>
> I'd much rather contain the stop words using a *gram based approach and
> offer a less populous but more accurate resultset.
>
>>
>> That said, there are two filters in Solr 3.X[1] that would do the
>> equivalent of what you have outlined:
>> CommonGramsFilter<http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsFilter.html>
>>  and
>> CommonGramsQueryFilter<http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsQueryFilter.html>.
>
> We use lucene directly, but I'll take a look - Thanks.
>
>> You can use these filters with a Lucene 3.X application by including the
>> (same-versioned) solr-core jar as a dependency.
>>
>> Steve
>
> --
>
> Rgds.
> *Dawn Raison*
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message