lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Indexing domain names?
Date Sun, 08 Nov 2009 10:54:16 GMT
> Hi,
> How do I go about indexing domain names? I currently index
> the domain, but
> it only works if I put the exact full domain in. For
> example:
> (this works)
> (this doesn't work)
> I am using the StandardAnalyzer as most of the other fields
> being indexed
> are free form text. Currently the "site" field is stored
> and tokenized.

StandardTokenizer recognizes and as singe token. Therefore they
do not match. You can use SimpleAnalyzer which uses LetterTokenizer. So will be broken into three tokens: www youtube com     will be boreken into two tokens : youtube com

By doing so will bring you

But query will also match a document like

Note that LetterTokenizer uses Character.isLetter() method to break text. If your input has
numbers like it will cause you problems.

In your case it is better to extend CharTonizer and override protected boolean isTokenChar(char
c) method according to your needs.

> As an additional improvement it would be even better if
> something like this
> worked:

To accomplish this, you can pre-process your queries to strip from first '/' char to the end.
You need to convert to
You can do it in a TokenFilter along with KeywordTokenizer with  writing custom code.

Hope this helps.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message