lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Indexing domain names?
Date Sun, 08 Nov 2009 14:43:08 GMT
<<<I am using the StandardAnalyzer as most of the other fields being indexed
are free form text. >>>

If you try Ahmet's suggestion, PerFieldAnalyzerWrapper is your friend. The
snippet
above makes me wonder if you've seen this class......

Best
Erick

On Sun, Nov 8, 2009 at 5:54 AM, AHMET ARSLAN <iorixxx@yahoo.com> wrote:

> > Hi,
> >
> > How do I go about indexing domain names? I currently index
> > the domain, but
> > it only works if I put the exact full domain in. For
> > example:
> >
> > site:www.youtube.com (this works)
> > site:youtube.com (this doesn't work)
> >
> > I am using the StandardAnalyzer as most of the other fields
> > being indexed
> > are free form text. Currently the "site" field is stored
> > and tokenized.
>
> StandardTokenizer recognizes www.youtube.com and youtube.com as singe
> token. Therefore they do not match. You can use SimpleAnalyzer which uses
> LetterTokenizer. So
>
> www.youtube.com will be broken into three tokens: www youtube com
> youtube.com     will be boreken into two tokens : youtube com
>
> By doing so site:youtube.com will bring you www.youtube.com
>
> But query site:youtube.com will also match a document like
> www.foo.com/youtube.com
>
> Note that LetterTokenizer uses Character.isLetter() method to break text.
> If your input has numbers like www.645cafe.com it will cause you problems.
>
> In your case it is better to extend CharTonizer and override protected
> boolean isTokenChar(char c) method according to your needs.
>
> > As an additional improvement it would be even better if
> > something like this
> > worked:
> >
> > site:youtube.com/foo
>
> To accomplish this, you can pre-process your queries to strip from first
> '/' char to the end. You need to convert youtube.com/foo/bla/bla to
> youtube.com.
> You can do it in a TokenFilter along with KeywordTokenizer with  writing
> custom code.
>
> Hope this helps.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message