lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Martin <iamntmar...@googlemail.com>
Subject Re: Concatenate multiple tokens into one
Date Thu, 11 Nov 2010 16:27:47 GMT
Hi Robert, All,

I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/
I want to include stopword removal and lowercase the incoming terms. The idea being to take,
"Foo Bar Baz Ltd" and turn it into "foobarbaz" for the EdgeNgram filter factory.
If anyone can tell me a simple way to concatenate tokens into one token again, similar too
the KeyWordTokenizer that would be super helpful.

Many thanks

Nick

On 11 Nov 2010, at 00:23, Robert Gr√ľndler wrote:

> 
> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:
> 
>> Are you sure you really want to throw out stopwords for your use case?  I don't think
autocompletion will work how you want if you do. 
> 
> in our case i think it makes sense. the content is targetting the electronic music /
dj scene, so we have a lot of words like "DJ" or "featuring" which
> make sense to throw out of the query. Also searches for "the beastie boys" and "beastie
boys" should return a match in the autocompletion.
> 
>> 
>> And if you don't... then why use the WhitespaceTokenizer and then try to jam the
tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer,
which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all,
it just creates one token from the entire input string. 
> 
> I started out with the KeywordTokenizer, which worked well, except the StopWord problem.
> 
> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which does what i'm
after:
> 
> public class ConcatFilter extends TokenFilter {
> 
> 	private TokenStream tstream;
> 
> 	protected ConcatFilter(TokenStream input) {
> 		super(input);
> 		this.tstream = input;
> 	}
> 
> 	@Override
> 	public Token next() throws IOException {
> 		
> 		Token token = new Token();
> 		StringBuilder builder = new StringBuilder();
> 		
> 		TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class);
> 		TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class);
> 		
> 		boolean incremented = false;
> 		
> 		while (tstream.incrementToken()) {
> 			
> 			if (typeAttribute.type().equals("word")) {
> 				builder.append(termAttribute.term());				
> 			}
> 			incremented = true;
> 		}
> 		
> 		token.setTermBuffer(builder.toString());
> 		
> 		if (incremented == true)
> 			return token;
> 		
> 		return null;
> 	}
> }
> 
> I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene
implementation after all.
> 
> 
> best
> 
> 
> -robert
> 
> 
> 
> 
>> 
>> Then lowercase, remove whitespace (or not), do whatever else you want to do to your
single token to normalize it, and then edgengram it. 
>> 
>> If you include whitespace in the token, then when making your queries for auto-complete,
be sure to use a query parser that doesn't do "pre-tokenization", the 'field' query parser
should work well for this. 
>> 
>> Jonathan
>> 
>> 
>> 
>> ________________________________________
>> From: Robert Gr√ľndler [robert@dubture.com]
>> Sent: Wednesday, November 10, 2010 6:39 PM
>> To: solr-user@lucene.apache.org
>> Subject: Concatenate multiple tokens into one
>> 
>> Hi,
>> 
>> i've created the following filterchain in a field type, the idea is to use it for
autocompletion purposes:
>> 
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens
separated by whitespace -->
>> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything
-->
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true" />  <!-- throw out stopwords -->
>> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement=""
replace="all" />  <!-- throw out all everything except a-z -->
>> 
>> <!-- actually, here i would like to join multiple tokens together again, to provide
one token for the EdgeNGramFilterFactory -->
>> 
>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
<!-- create edgeNGram tokens for autocomplete matches -->
>> 
>> With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens
on input strings with whitespaces in it. This leads to the following results:
>> Input Query: "George Cloo"
>> Matches:
>> - "George Harrison"
>> - "John Clooridge"
>> - "George Smith"
>> -"George Clooney"
>> - etc
>> 
>> However, only "George Clooney" should match in the autocompletion use case.
>> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates
all the tokens generated by the WhitespaceTokenizerFactory.
>> Are there filters which can do such a thing?
>> 
>> If not, are there examples how to implement a custom TokenFilter?
>> 
>> thanks!
>> 
>> -robert
>> 
>> 
>> 
>> 
> 


Mime
View raw message