lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Nicoll <stephane.nic...@gmail.com>
Subject Re: Twitter analyser
Date Sat, 09 Nov 2013 08:59:32 GMT
Replying to self: silly me. I am obviously creating the array with the
wrong length.
final String term = new String(buffer, 1, length);

should be replaced by
final String term = new String(buffer, 1, length -1);

and the silly trim can go away. I guess I need more coffee.

S.




On Sat, Nov 9, 2013 at 9:45 AM, Stephane Nicoll
<stephane.nicoll@gmail.com>wrote:

> Hi,
>
> This is what I've tried:
> https://gist.github.com/anonymous/7383104
>
> So far so good except that something is definitely wrong in my code as the
> synonym is not emitted as a valid token it seems. This is how my indexing
> analyzer is built:
>
>  private static final class MyIndexAnalyzer extends ReusableAnalyzerBase {
>         @Override
>         protected TokenStreamComponents createComponents(String fieldName,
> Reader reader) {
>             final Tokenizer tokenizer = new
> WhitespaceTokenizer(Version.LUCENE_36, reader);
>             final TwitterFilter twitterFilter = new
> TwitterFilter(tokenizer);
>             final LowerCaseFilter filter = new
> LowerCaseFilter(Version.LUCENE_36, twitterFilter);
>             return new TokenStreamComponents(tokenizer, filter);
>         }
>     }
>
> I am expecting the lower filter to pick up the synonym exactly the same
> way as the original token but it does not. If I have the following tweet
> "Bla Bla #SomeTAG", "#sometag" matches but "sometag" does not. All other
> use cases not involving a case mismatch work as expected.
>
> Does anyone knows what's wrong in my code?
>
> Thanks for the support!
>
> S.
>
>
>
> On Tue, Nov 5, 2013 at 2:17 PM, Erick Erickson <erickerickson@gmail.com>wrote:
>
>> If your universe of items you want to match this way is small,
>> consider something akin to synonyms. Your indexing process
>> emits two tokens, with and without the @ or # which should
>> cover your situation.
>>
>> FWIW,
>> Erick
>>
>>
>> On Tue, Nov 5, 2013 at 2:40 AM, St├ęphane Nicoll
>> <stephane.nicoll@gmail.com>wrote:
>>
>> > Hi,
>> >
>> > I am building an application that indexes tweet and offer some basic
>> > search facilities on them.
>> >
>> > I am trying to find a combination where the following would work:
>> >
>> > * foo matches the foo word, a mention (@foo) or the hashtag (#foo)
>> > * @foo only matches the mention
>> > * #foo matches only the hashtag
>> >
>> > It should matches complete word so I used the WhiteSpaceAnalyzer for
>> > indexing.
>> >
>> > Any recommendation for this use case?
>> >
>> > Thanks !
>> > S.
>> >
>> > Sent from my iPhone
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message