lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sudha Verma <>
Subject URL Tokenization
Date Wed, 23 Jun 2010 18:06:30 GMT

I am new to lucene and I am using Lucene 3.0.2.

I am using Lucene to parse text which may contain URLs. I noticed the
StandardTokenizer keeps the email addresses in one token, but not the URLs.
I also looked at Solr wiki pages, and even though the wiki page for
solr.StandardTokenizerFactory says it keeps track of the URL token type - it
does not seem to be the case.

Is there an Analyzer implementation that can keep the URLs intact into one
token? or does anyone have an example of that for Solr or Lucene?

Thanks much,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message