lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Twitter analyser
Date Tue, 05 Nov 2013 15:20:12 GMT
You have to get the values _into_ the index with the special characters,
that's where the issue is. Depending on your analysis chain special
characters may or may not be even in your index to search in the first
place.

So it's not how many different words are after the special characters as
much as how many special characters there are. So what I'm thinking is
that as you index documents, you detect #foo, #blah, #whatever and
index #foo, foo, #blah, blah etc. If all you have to do is specially handle
tokens that start with just a few different chars it's not very hard.

FWIW,
Erick


On Tue, Nov 5, 2013 at 8:33 AM, Stephane Nicoll
<stephane.nicoll@gmail.com>wrote:

> Hi,
>
> Thanks for the reply. It's an index with tweets so any word really is a
> target for this. This would mean a significant increase of the index. My
> volumes are really small so that shouldn't be a problem (but
> performance/scalability is a concern).
>
> I have the control over the query. Another solution would be to translate a
> query on "foo" to "foo or #foo or @foo"
>
> WDYT?
>
> Thanks!
> S.
>
>
>
>
> On Tue, Nov 5, 2013 at 2:17 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > If your universe of items you want to match this way is small,
> > consider something akin to synonyms. Your indexing process
> > emits two tokens, with and without the @ or # which should
> > cover your situation.
> >
> > FWIW,
> > Erick
> >
> >
> > On Tue, Nov 5, 2013 at 2:40 AM, St├ęphane Nicoll
> > <stephane.nicoll@gmail.com>wrote:
> >
> > > Hi,
> > >
> > > I am building an application that indexes tweet and offer some basic
> > > search facilities on them.
> > >
> > > I am trying to find a combination where the following would work:
> > >
> > > * foo matches the foo word, a mention (@foo) or the hashtag (#foo)
> > > * @foo only matches the mention
> > > * #foo matches only the hashtag
> > >
> > > It should matches complete word so I used the WhiteSpaceAnalyzer for
> > > indexing.
> > >
> > > Any recommendation for this use case?
> > >
> > > Thanks !
> > > S.
> > >
> > > Sent from my iPhone
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message