lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <>
Subject RE: Keep URLs intact and not tokenized by the StandardTokenizer
Date Thu, 19 Nov 2009 19:15:34 GMT
Hi Sudha,

In the past, I've built regexes to recognize URLs using the information here:

The above, however, is currently a dead link.

Here's the Internet Archive's WayBack Machine's cache of this page from August 2007:


Here's the same content, of unknown vintage, as a text file (even though it has a .html extension):

Also, Jeffrey Friedl's book "Mastering Regular Expressions", 2nd edition (but not the 1st
edition), has a section on recognizing URLs in Chapter 5.


On 11/19/2009 at 12:58 AM, Sudha Verma wrote:
> Hi,
> I am using lucene 2-9-1.
> I am reading in free text documents which I index using lucene and the
> StandardAnalyzer at the moment.
> The StandardAnalyzer keeps email addresses intact and does not tokenize
> them. Is there something similar for
> URLs? This seems like a common need. So, I thought I'd check if there
> is anything out there that does it already.
> I'd appreciate any help.
> Thanks,
> sudha

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message