lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: Twitter analyser
Date Tue, 05 Nov 2013 16:32:00 GMT
You can specify custom character types with the word delimiter filter, so 
you could define "@" and "#" as "digit" and set SPLIT_ON_NUMERICS. This 
would cause "@foo" to tokenize as two adjacent terms, ditto for "#foo". 
Unfortunately, A user name or tag that starts with a digit would not 
tokenize as desired, but that seems uncommon. "foo" would match all three 
since the "@" or "#" would tokenize as a separate term.


public WordDelimiterFilter(TokenStream in,
                           byte[] charTypeTable,
                           int configurationFlags,
                           CharArraySet protWords)


-- Jack Krupansky
-----Original Message----- 
From: St├ęphane Nicoll
Sent: Tuesday, November 05, 2013 2:40 AM
Subject: Twitter analyser


I am building an application that indexes tweet and offer some basic
search facilities on them.

I am trying to find a combination where the following would work:

* foo matches the foo word, a mention (@foo) or the hashtag (#foo)
* @foo only matches the mention
* #foo matches only the hashtag

It should matches complete word so I used the WhiteSpaceAnalyzer for 

Any recommendation for this use case?

Thanks !

Sent from my iPhone

To unsubscribe, e-mail:
For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message