lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: FW: Use of hyphens in StandardAnalyzer
Date Mon, 25 Oct 2010 00:24:15 GMT
How about replacing "-" with some arbitrary character sequence with MappingCharFilter before
tokenizer and then restoring that '-' with PatternReplaceFilter after the tokenizer?

May be you can just eat '-' with charFilter so that Lawton-Browne becomes LawtonBrowne.


--- On Mon, 10/25/10, Martin O'Shea <appy74@dsl.pipex.com> wrote:

> From: Martin O'Shea <appy74@dsl.pipex.com>
> Subject: FW: Use of hyphens in StandardAnalyzer
> To: java-user@lucene.apache.org
> Date: Monday, October 25, 2010, 12:28 AM
> A good suggestion. But I'm using
> Lucene 3.0.2 and the constructor for a StandardAnalyzer has
> Version_30 as its highest value. Do you know when 3.1 is
> due?
> 
> -----Original Message-----
> From: Steven A Rowe [mailto:sarowe@syr.edu] 
> Sent: 24 Oct 2010 21 31
> To: java-user@lucene.apache.org
> Subject: RE: Use of hyphens in StandardAnalyzer
> 
> Hi Martin,
> 
> StandardTokenizer and -Analyzer have been changed, as of
> future version 3.1 (the next release) to support the Unicode
> segmentation rules in UAX#29.  My (untested) guess is
> that your hyphenated word will be kept as a single token if
> you set the version to 3.1 or higher in the constructor.
> 
> Steve
> 
> > -----Original Message-----
> > From: Martin O'Shea [mailto:appy74@dsl.pipex.com]
> > Sent: Sunday, October 24, 2010 3:59 PM
> > To: java-user@lucene.apache.org
> > Subject: Use of hyphens in StandardAnalyzer
> > 
> > Hello
> > 
> > 
> > 
> > I have a StandardAnalyzer working which retrieves
> words and frequencies
> > from
> > a single document using a TermVectorMapper which is
> populating a HashMap.
> > 
> > 
> > 
> > But if I use the following text as a field in my
> document, i.e.
> > 
> > 
> > 
> > addDoc(w, "lucene Lawton-Browne Lucene");
> > 
> > 
> > 
> > The word frequencies returned in the HashMap are:
> > 
> > 
> > 
> > browne 1
> > 
> > lucene 2
> > 
> > lawton 1
> > 
> > 
> > 
> > The problem is the words 'lawton' and 'browne'. If
> this is an actual
> > 'double-barreled' name, can Lucene recognise it as
> 'Lawton-Browne' where
> > the
> > name is actually a single word?
> > 
> > 
> > 
> > I've tried combinations of:
> > 
> > 
> > 
> > addDoc(w, "lucene \"Lawton-Browne\" Lucene");
> > 
> > 
> > 
> > And single quotes but without success.
> > 
> > 
> > 
> > Thanks
> > 
> > 
> > 
> > Martin O'Shea.
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message