lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sudha Verma <verma.su...@gmail.com>
Subject Re: URL Tokenization
Date Thu, 24 Jun 2010 16:57:21 GMT
Hi Steve,

Thanks for the quick reply and implementing support for URL
tokenization. Another newbie question about applying this patch.

I have the Lucene 3.0.2 source and I downloaded the patch and tried to apply
it:

lucene-3.0.2> patch -p0 < LUCENE-2167.patch

Comes back with the error message:

....(output truncated)
can't find file to patch at input line 13106
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:


After looking at the line, it looks like it's trying to find
modules/analysis/common/build.xml -- which is not part of the official 3.0.2
src release. And thinking about it, may be I need to use the latest source
(or a nightly build). But, I couldn't figure how to get that. The hudson
link for nightly builds on the apache-lucene site seems to be broke. Or may
be I have a different problem.

I'd appreciate any help.

Thanks,
Sudha



On Wed, Jun 23, 2010 at 12:21 PM, Steven A Rowe <sarowe@syr.edu> wrote:

> Hi Sudha,
>
> There is such a tokenizer, named NewStandardTokenizer, in the most recent
> patch on the following JIRA issue:
>
>   https://issues.apache.org/jira/browse/LUCENE-2167
>
> It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and
> e-mails too, in accordance with the relevant IETF RFCs.
>
> Steve
>
> > -----Original Message-----
> > From: Sudha Verma [mailto:verma.sudha@gmail.com]
> > Sent: Wednesday, June 23, 2010 2:07 PM
> > To: java-user@lucene.apache.org
> > Subject: URL Tokenization
> >
> > Hi,
> >
> > I am new to lucene and I am using Lucene 3.0.2.
> >
> > I am using Lucene to parse text which may contain URLs. I noticed the
> > StandardTokenizer keeps the email addresses in one token, but not the
> > URLs.
> > I also looked at Solr wiki pages, and even though the wiki page for
> > solr.StandardTokenizerFactory says it keeps track of the URL token type -
> > it does not seem to be the case.
> >
> > Is there an Analyzer implementation that can keep the URLs intact into
> one
> > token? or does anyone have an example of that for Solr or Lucene?
> >
> > Thanks much,
> > Sudha
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message