lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Brown <>
Subject Re: Which Tokeniser (and/or filter)
Date Wed, 08 Feb 2012 08:05:31 GMT
Apologies if things were a little vague.

Given the example snippet to index (numbered to show searches needed to

1: i am a sales-manager in here
2: using and .net daily
3: working in design.
4: using something called sage 200. and i'm fluent
5: german sausages.
6: busy A&E dept earning £10,000 annually

... all with newlines in place.

able to match...

1. sales
1. "sales manager"
1. sales-manager
1. "sales-manager"
2. .net
3. design
4. sage 200
6. A&E
6. £10,000

But do NOT match "fluent german" from 4 + 5 since there's a newline
between them when indexed, but not when searched.

Do the filters (wdf in this case) not create multiple tokens, so if
splitting on period in "" would create tokens for all of "asp",
"asp.", "", ".net", "net".



Web Design and Online Marketing

-----Original Message-----
From: Chris Hostetter <>
Subject: Re: Which Tokeniser (and/or filter)
Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)

: This all seems a bit too much work for such a real-world scenario?

You haven't really told us what your scenerio is.

You said you want to split tokens on whitespace, full-stop (aka: 
period) and comma only, but then in response to some suggestions you added 
comments other things that you never mentioned previously...

1) evidently you don't want the "." in to cause a split in tokens?
2) evidently you not only want token splits on newlines, but also 
positition gaps to prevent phrases matching across newlines.

...these are kind of important details that affect suggestions people 
might give you.

can you please provide some concrete examples of hte types of data you 
have, the types of queries you want them to match, and the types of 
queries you *don't* want to match?


View raw message