lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Were <chris.w...@gmail.com>
Subject Re: Indexing domain names?
Date Mon, 09 Nov 2009 01:11:47 GMT
Thanks for the tips guys, got it working now.

Cheers,
Chris

On Sun, Nov 8, 2009 at 6:43 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> <<<I am using the StandardAnalyzer as most of the other fields being
> indexed
> are free form text. >>>
>
> If you try Ahmet's suggestion, PerFieldAnalyzerWrapper is your friend. The
> snippet
> above makes me wonder if you've seen this class......
>
> Best
> Erick
>
> On Sun, Nov 8, 2009 at 5:54 AM, AHMET ARSLAN <iorixxx@yahoo.com> wrote:
>
> > > Hi,
> > >
> > > How do I go about indexing domain names? I currently index
> > > the domain, but
> > > it only works if I put the exact full domain in. For
> > > example:
> > >
> > > site:www.youtube.com (this works)
> > > site:youtube.com (this doesn't work)
> > >
> > > I am using the StandardAnalyzer as most of the other fields
> > > being indexed
> > > are free form text. Currently the "site" field is stored
> > > and tokenized.
> >
> > StandardTokenizer recognizes www.youtube.com and youtube.com as singe
> > token. Therefore they do not match. You can use SimpleAnalyzer which uses
> > LetterTokenizer. So
> >
> > www.youtube.com will be broken into three tokens: www youtube com
> > youtube.com     will be boreken into two tokens : youtube com
> >
> > By doing so site:youtube.com will bring you www.youtube.com
> >
> > But query site:youtube.com will also match a document like
> > www.foo.com/youtube.com
> >
> > Note that LetterTokenizer uses Character.isLetter() method to break text.
> > If your input has numbers like www.645cafe.com it will cause you
> problems.
> >
> > In your case it is better to extend CharTonizer and override protected
> > boolean isTokenChar(char c) method according to your needs.
> >
> > > As an additional improvement it would be even better if
> > > something like this
> > > worked:
> > >
> > > site:youtube.com/foo
> >
> > To accomplish this, you can pre-process your queries to strip from first
> > '/' char to the end. You need to convert youtube.com/foo/bla/bla to
> > youtube.com.
> > You can do it in a TokenFilter along with KeywordTokenizer with  writing
> > custom code.
> >
> > Hope this helps.
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message