lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AHMET ARSLAN <iori...@yahoo.com>
Subject Re: Indexing domain names?
Date Sun, 08 Nov 2009 10:54:16 GMT
> Hi,
> 
> How do I go about indexing domain names? I currently index
> the domain, but
> it only works if I put the exact full domain in. For
> example:
> 
> site:www.youtube.com (this works)
> site:youtube.com (this doesn't work)
> 
> I am using the StandardAnalyzer as most of the other fields
> being indexed
> are free form text. Currently the "site" field is stored
> and tokenized.

StandardTokenizer recognizes www.youtube.com and youtube.com as singe token. Therefore they
do not match. You can use SimpleAnalyzer which uses LetterTokenizer. So 

www.youtube.com will be broken into three tokens: www youtube com
youtube.com     will be boreken into two tokens : youtube com

By doing so site:youtube.com will bring you www.youtube.com

But query site:youtube.com will also match a document like www.foo.com/youtube.com

Note that LetterTokenizer uses Character.isLetter() method to break text. If your input has
numbers like www.645cafe.com it will cause you problems.

In your case it is better to extend CharTonizer and override protected boolean isTokenChar(char
c) method according to your needs.

> As an additional improvement it would be even better if
> something like this
> worked:
> 
> site:youtube.com/foo

To accomplish this, you can pre-process your queries to strip from first '/' char to the end.
You need to convert youtube.com/foo/bla/bla to youtube.com.
You can do it in a TokenFilter along with KeywordTokenizer with  writing custom code.

Hope this helps.


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message