lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <>
Subject RE: URL Tokenization
Date Wed, 23 Jun 2010 18:21:13 GMT
Hi Sudha,

There is such a tokenizer, named NewStandardTokenizer, in the most recent patch on the following
JIRA issue:

It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and e-mails too, in accordance
with the relevant IETF RFCs.


> -----Original Message-----
> From: Sudha Verma []
> Sent: Wednesday, June 23, 2010 2:07 PM
> To:
> Subject: URL Tokenization
> Hi,
> I am new to lucene and I am using Lucene 3.0.2.
> I am using Lucene to parse text which may contain URLs. I noticed the
> StandardTokenizer keeps the email addresses in one token, but not the
> URLs.
> I also looked at Solr wiki pages, and even though the wiki page for
> solr.StandardTokenizerFactory says it keeps track of the URL token type -
> it does not seem to be the case.
> Is there an Analyzer implementation that can keep the URLs intact into one
> token? or does anyone have an example of that for Solr or Lucene?
> Thanks much,
> Sudha
View raw message