lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: URL Tokenization
Date Wed, 23 Jun 2010 18:21:13 GMT
Hi Sudha,

There is such a tokenizer, named NewStandardTokenizer, in the most recent patch on the following
JIRA issue: 

   https://issues.apache.org/jira/browse/LUCENE-2167

It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and e-mails too, in accordance
with the relevant IETF RFCs.

Steve

> -----Original Message-----
> From: Sudha Verma [mailto:verma.sudha@gmail.com]
> Sent: Wednesday, June 23, 2010 2:07 PM
> To: java-user@lucene.apache.org
> Subject: URL Tokenization
> 
> Hi,
> 
> I am new to lucene and I am using Lucene 3.0.2.
> 
> I am using Lucene to parse text which may contain URLs. I noticed the
> StandardTokenizer keeps the email addresses in one token, but not the
> URLs.
> I also looked at Solr wiki pages, and even though the wiki page for
> solr.StandardTokenizerFactory says it keeps track of the URL token type -
> it does not seem to be the case.
> 
> Is there an Analyzer implementation that can keep the URLs intact into one
> token? or does anyone have an example of that for Solr or Lucene?
> 
> Thanks much,
> Sudha
Mime
View raw message