lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: AW: AW: AW: AW: "fuzzy prefix" search
Date Wed, 04 May 2011 19:59:49 GMT
We do have EdgeNGramTokenizer if that is what you are after.
See how Solr uses it here:
http://search-lucene.com/c/Solr:/src/java/org/apache/solr/analysis/EdgeNGramTokenizerFactory.java||EdgeNGramTokenizer


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Clemens Wyss <clemensdev@mysign.ch>
> To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
> Sent: Wed, May 4, 2011 2:07:40 AM
> Subject: AW: AW: AW: AW: "fuzzy prefix" search
> 
> I know this is just an example.
> But even the WhitespaceAnalyzer takes the  words apart, which I don't want. I 
>would like the phrases as they are (maximum 3  words, e.g. "Merlot del Ticino", 
>...) to be n-gram-ed. I hence want to have the  n-grams.
> Mer
> Merl
> Merlo
> Merlot
> Merlot
> Merlot  d
> ...
> 
> Regards
> Clemens
> > -----Ursprüngliche  Nachricht-----
> > Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> >  Gesendet: Dienstag, 3. Mai 2011 23:12
> > An: java-user@lucene.apache.org
> >  Betreff: Re: AW: AW: AW: "fuzzy prefix" search
> >
> > Clemens - that's  just an example.  Stick another tokenizer in there, like
> >  WhitespaceTokenizer in there, for example.
> >
> > Otis
> >  ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem
> >  search :: http://search-lucene.com/
> >
> >
> >
> > ----- Original  Message ----
> > > From: Clemens Wyss <clemensdev@mysign.ch>
> > > To:  "java-user@lucene.apache.org"  <java-user@lucene.apache.org>
> >  > Sent: Tue, May 3, 2011 4:31:14 PM
> > > Subject: AW: AW: AW: "fuzzy  prefix" search
> > >
> > > But doesn't the KeyWordTokenizer  extract single words out oft he
> > >stream? I would  like to create  n-grams on the stream (field content) as 
it
> > is...
> > >
> >  > >  -----Ursprüngliche Nachricht-----
> > > > Von: Otis  Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> >  > >  Gesendet: Dienstag, 3. Mai 2011 21:31
> > > > An: java-user@lucene.apache.org
> >  > >  Betreff: Re: AW: AW: "fuzzy prefix" search
> > >  >
> > > > Clemens,
> > > >
> > > > Something a  la:
> > > >
> > > > public TokenStream tokenStream  (String  fieldName, Reader r) {
> > > >   return nw  EdgeNGramTokenFilter(new  KeywordTokenizer(r),
> > > >  EdgeNGramTokenFilter.Side.FRONT, 1, 4); }
> > > >
> > >  >
> > > > Check out page 265 of Lucene in Action 2.
> > >  >
> > > >  Otis
> > > > ----
> > > >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene
> > > >  ecosystem search :: http://search-lucene.com/
> > > >
> > > >
> > >  >
> > > > ----- Original  Message ----
> > > > >  From: Clemens Wyss <clemensdev@mysign.ch>
> > >  > > To:  "java-user@lucene.apache.org"   <java-user@lucene.apache.org>
> >  > >  > Sent: Tue, May 3, 2011 12:57:39 PM
> > > > >  Subject: AW: AW: "fuzzy  prefix" search
> > > > >
> > >  > > How does an simple Analyzer look that  just "n-grams" the   
>docs/fields.
> > > > >
> > > > > class   SimpleNGramAnalyzer extends  Analyzer {  @Override
> > > >  > public TokenStream tokenStream ( String fieldName,   Reader reader
 )
> > > > > {
> > > > >      EdgeNGramTokenFilter...  ???
> > > > > }
> > > >  > }
> > > >  >
> > > > > >  -----Ursprüngliche Nachricht-----
> > > > > > Von:   Otis  Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> >  > >  > >  Gesendet: Dienstag, 3. Mai 2011 13:36
> >  > > > > An: java-user@lucene.apache.org
> >  > >  > >  Betreff: Re: AW: "fuzzy prefix" search
> >  > > >  >
> > > > > > Hi,
> > > > >  >
> > > > > > I  didn't  read this thread closely,  but just in case:
> > > > > > * Is this  something   you can handle with synonyms?
> > > > > > * If this is for   English and you are  trying to handle typos,
> > > > > >  there is a
> > >list
> > > >  >of
> > > >  > > common English misspellings  out there that you  could use
 for
> > > > > > this
> > > > perhaps.
> > >  > > > * Have you  considered  n-gramming your tokens?   Not sure
if
> > > > > > this would
> > > >   help,
> > > > > > didn't read  messages/examples closely  enough, but  you may
want
> > > > > > to
> > > >  look at
> > > > > > this if  you haven't done  so  yet.
> > > > > >
> > > > > > Otis
> > >  > > > ----
> > > >  > > Sematext :: http://sematext.com/ :: Solr  -  Lucene - Nutch
> > > > Lucene  ecosystem
> > >  > > > search :: http://search-lucene.com/
> > > > > >
> > > >  >  >
> > > > > >
> > > > > > -----  Original  Message  ----
> > > > > > > From: Clemens  Wyss <clemensdev@mysign.ch>
> > >  > >  > > To:  "java-user@lucene.apache.org"    <java-
> > user@lucene.apache.org>
> > >  >  > >  > Sent: Tue, May 3, 2011 5:25:30 AM
> > >  > > > >  Subject: AW: "fuzzy prefix"  search
> > >  > > > >
> > > > >  > >  >PrefixQuery
> > > > > > > I'd like the   combination  of prefix and fuzzy ;-) because
> > > > > >  > people
> > >could
> > > > > >  >also   type "menlo" or "märl" and in any of these cases
I'd
> > > > > >  like  to
> > >get
> > > > > >  >a hit on  Merlot (for suggesting  Merlot)
> > > > > > >
> >  > > > > > >   -----Ursprüngliche   Nachricht-----
> > > > > > > > Von: Ian  Lea   [mailto:ian.lea@gmail.com]
> > > > >  > >  >  Gesendet:  Dienstag, 3. Mai 2011 11:22   > An:
> > > > > > > java-user@lucene.apache.org
> >  > >  > >  > >  Betreff: Re: "fuzzy prefix"  search
> > > > >  > > >
> > > > >  >  > > I'd assumed that  FuzzyQuery  wouldn't ignore  case
but I
> > > > > > could be
> > > > wrong.
> >  > > >  > > >  What would be the edit  distance  between  "mer"
 and 
>"merlot"?
> > > > Would
> > >  > > > > > it be less that 1.5  which I   reckon would  be the
value of
> > > > > > > >  length(term)*0.5 as  detailed in  the  javadocs?
 Seems
> > > > > >  > > unlikely,
> > >but
> > > > > > > > I  don't really  know anything about   the Levenshtein
(edit
> >  distance)
> > > > > > algorithm as  used by   FuzzyQuery.
> > > > > > > >  Wouldn't a PrefixQuery  be  more  appropriate here?
> > > > > > >  >
> > > > > > >  >
> > > > > >  > >   --
> > > > > > > >  Ian.
> >  > > > > > >
> > > > > > > > On Tue,  May  3,  2011 at 10:10 AM, Clemens Wyss
> > > > > >  > > <clemensdev@mysign.ch>
> > >  > >  > >  >  wrote:
> > > > > >  > > > Unfortunately  lowercasing doesn't  help.
> > >  > > > > > > Also,   doesn't the FuzzyQuery ignore   casing?
> > > > > > > >  >
> > > > >  > > > >>   -----Ursprüngliche  Nachricht-----
> >  > > > > > > >> Von: Ian Lea   [mailto:ian.lea@gmail.com]
> > > > >  > >  >  >>  Gesendet: Dienstag, 3. Mai 2011  11:06
> > > > >  > > > >>  An: java-user@lucene.apache.org
> >  > >  > >  > >  >> Betreff: Re: "fuzzy  prefix"  search
> > > > > > > >   >>
> > > > > > > >  >>  Mer !=  mer.  The latter will be  what
is indexed
> > > > > >  > > because
> > > > > > > > >>  StandardAnalyzer calls   LowerCaseFilter.
> > > > > > >  > >>
> > > > > > >  > >>    --
> > > > > > > > >> Ian.
> > > >  >  > > > >>
> > > > > > > >   >>
> > > > >  > > > >> On  Tue, May  3, 2011 at 9:56 AM,  Clemens
 Wyss
> > > > > > >  > <clemensdev@mysign.ch>
> > >  > >  > >  > >>  wrote:
> > > >  > > > > >>  > Sorry for coming back  to my issue.
 Can anybody
> > > > > > > > >> explain why
> >  >my
> > > > > > > > "simple"
> > > > >  >  > >  >> unit test below fails? Any   hint/help
 appreciated.
> > > >  > > > >  >> >
> > > > > > > > >>  >   Directory  directory = new RAMDirectory();
> > > > > > >  > >> IndexWriter
> > > >  > > > >   >> > indexWriter =  new IndexWriter( 
directory, new
> >  > > > > >  > >> >   StandardAnalyzer(
> > > > > > > >    Version.LUCENE_31
> > > >  > > > > >> >  ),   IndexWriter.MaxFieldLength.UNLIMITED
 ); Document
> > >  > document
> > > >  > >  =
> > > > >  > > > new
> > > > > > > >  >> >  Document();   document.add( new Field(
"test",  "Merlot",
> > >  > > > > > >> >  Field.Store.YES,   Field.Index.ANALYZED
) );
> > > > > > > > >>  >   indexWriter.addDocument(
> > > > > > > >   >> >  document );  IndexReader indexReader
=
> > >  > > > > > indexWriter.getReader();
> > > > >  >  > > >> >  IndexSearcher searcher = new   IndexSearcher(
> > > > > > indexReader );
> > >  >  > > > > >> > Query q = new FuzzyQuery(    new Term(
 "test", "Mer" ),
> > 0.5f,
> > >0,
> > > >  > > > > >> > 10 ); //  or  Query q =  new  FuzzyQuery(
new Term(
> > > > > > > > >> >  "test",
> > "Mer"
> > > > >  > > >  >>  > ), 0.5f); TopDocs  result =  searcher.search(
q, 10
> > >  > > );
> > > > > >  > > >> >   Assert.assertEquals( 1,  result.totalHits
 );
> > > > >  > >  > >> >
> > > > > > > >  >> > -    Clemens
> > > > > > > >  >> >
> > > > > > > >  >>  >>  -----Ursprüngliche  Nachricht-----
> > > >  >  > > > >> >> Von:  Clemens Wyss [mailto:clemensdev@mysign.ch]
> > > >  > >  > >  >>  >> Gesendet: Montag, 2. Mai  2011
 23:01
> > > > > > > >  >> >>  An: java-user@lucene.apache.org
> >  > >  > >  > >  >> >> Betreff: AW:  "fuzzy prefix"
 search
> > > > > >  > >  >>  >>
> > > > >  > > > >>  >> Is it the  combination of FuzzyQuery
and  Term   which
> > > > > makes
> > >the
> > > > > >  > >  >> >>  search to go for "word   boundaries"?
> > > > > > > >   >>  >>
> > > > > > > > >> >> >     -----Ursprüngliche Nachricht-----
> > > > > > > >  >> >>  > Von:  Clemens  Wyss [mailto:clemensdev@mysign.ch]
> > > >  > >  > >  >>  >> > Gesendet: Montag,  2. Mai
2011  14:13
> > > > > > >  > >>  >> >  An: java-user@lucene.apache.org
> >  > >  > >  > >  >> >> > Betreff:  AW: "fuzzy  prefix"
 search
> > > > > > > >  >>  >>  >
> > > > > > > >  >>  >> > I tried this too,  but unfortunately
 I  only get
> > > > > > > > >> hits  when
> >  > > > > > >  >> >> > the search term is a  least
  as long as the word to  
>be
> > >looked
> > >  > up.
> > > > > > > > >> >>    >
> > > > > > > >  >> >> >  E.g.:
> > > > >  > > >  >> >> >  ...
> > > > > > > >  >>  >>  >  Directory directory = new
RAMDirectory();  
>IndexWriter
> >  > > > > > >   >> >> >  indexWriter =  new IndexWriter(
directory,  >>  >>
> > > > >  > > > > IndexManager.getIndexingAnalyzer(
> > > >  >  > >  > >>  >> LOCALE_DE ),
> >  > > > > >  > >> >>  >                 IndexWriter.MaxFieldLength.UNLIMITED
 
>);
> > > > > > > >  >> >>   >
> > > > > > > >  >> >>  >  Document document = new
 Document();
> > document.add(
> > >  > new
> > > >  > > > > Field(
> > > >  > > > >  >> >>  > "test", "Merlot",
> >  > > > > > > >>   >>  >              Field.Store.YES,
  Field.Index.ANALYZED 
>)  );
> > > > > > > >  >>  >>   indexWriter.addDocument(
> > > > > > > > >>   >> >  document  );
> > > > > > > >  >>  >> >
> > > > > > > > >>  >>  >   IndexReader indexReader
=  indexWriter.getReader();
> > > > > > >  >   >> >> > IndexSearcher
> > > > > > >   >  >> >>  > searcher = new IndexSearcher(
  indexReader );  >>
> > > > > > > >>   >
> > > > > > > >  >> >> > Query q  = new FuzzyQuery(
  new Term( "test", "Mer"  
),
> >  >0.6f,
> > > > > > > > >> >> > 1  );   TopDocs  result = searcher.search(
q, 10 );
> > > >  > > > >  >>  >> >  Assert.assertEquals(  >>
 >>  > 1,
> > >  > > > > > >>  >> result.totalHits ); ...
> >  > > > > >  >   >> >> >
> > >  > > > > > >> >> >  >   -----Ursprüngliche
 Nachricht-----
> > > > > > >  >  >> >> >  > Von: Uwe Schindler
[mailto:uwe@thetaphi.de]  >>
> > >  > > > > > >>  > > Gesendet: Montag, 2. Mai  2011
 13:50  >> >> >  > An:
> > > >  > > > > java-user@lucene.apache.org
> >  > >  > >  > >  >> >> > >  Betreff: RE: "fuzzy
 prefix"  search
> > > > > > >  > >>  >> >  >
> > > > > > >  >  >> >> > > Hi,
> > > >  > >  > > >>  >> >  >
> > > > >  >  > > >> >> > > You can pass an integer
   to  FuzzyQuery which 
>defines
> > the
> > > > > > >  > >> >> >  >  number of  characters
that are  seen as prefix.
> > > > > > > > >> >> > So  all
> > > >  > > > >  >> >> >  > terms must match
> > > >  > > > > >>    >> > > this prefix and the
rest  of each term is matched  
>using
> > > > >fuzzy.
> > > > > > > >   >> >> > >
> > > > > > > > >>   >>  > >  Uwe
> > > > > > > >  >> >> >   >
> > > > > > > >  >> >> > >  -----
> > > >  > > >  >  >> >> > > Uwe Schindler
> > > >  >  > > > >>   >> > > H.-H.-Meier-Allee
 63, D-28213  Bremen
> > > > > > > >  >> http://www.thetaphi.de
> >  > >  > > > >  >> >> > >   eMail: uwe@thetaphi.de
> > > > > >  > >  >>  >> >  >
> > > > >  > > > >>  >> > > >  -----Original
 Message-----
> > > > > > >  >  >> >>  > >  > From: Clemens
Wyss
> > > > > > >  [mailto:clemensdev@mysign.ch]
> > > >  > >  > >  >>  >> > > > Sent:  Monday,
May 02,  2011 1:47 PM   >> > > > 
>To:
> >  > > > > > >  >> java-user@lucene.apache.org
> >  > >  > >  > >  >> >> > >  > Subject:  "fuzzy
prefix"  search  >> >> >  > >
> > > > >  > > > >>  >>  > > > I'd  like to search
 fuzzily but not on a full   
>term.
> > > > > > > > >>  >> >   > > E.g.
> > > > > > >  > >>   >> > > > I have a text
"Merlot  del  Ticino"
> >  > >  > > > > >> >> > > > I'd  like
> > > > > >  >  > >>  >>  > > > "mer", "merr",
"melo",  ... to  match.
> > >  > > > > > >>  >> >  > >
> >  > > > > >  > >> >> > > > If  I  use  FuzzyQuery
only  "merlot,  "merlott" 
>hit.
> >  >What
> > > >  > > > > >> >>   >  > >  Query-combination
should I use?
> > > >  > > > > >>   >> > >  >
> > >  > > > > > >> >> >  > >   Thx
> > > > > > > > >> >> >   >  > Clemens
> > > > > >  > > >> >> >  >  >
> > > > > > > > >>   >>  > > >
> > > >  > > > > >> >>  > >  >
> > > > > >  > > >>  >> > > >
> >  >--------------------------------------------------------
> > > >  > > >  >  >> >> > > > ----
> >  > > > > > >  >>  >>  > > >  ---
> > > > > > > >  >> >> > >  >  ---
> > > > > > > >   >> >>  > > > --
> > > > > > >  > >>   >> > > > -  To unsubscribe,
e-mail:
> > > > >  >   > > >> >> > > > java-user-unsubscribe@lucene.apache.org
> >  > >  > >  > >  >> >> > >  > For additional
 commands,  e-mail:
> > > > > >  > >  >> >> >  > > java-user-help@lucene.apache.org
    >> >> > >
> > > > > > > >  >> >>  >  >
> > > > > > >  >  >> >> >  >
> > > > > > >  >  >> >> > >
> > > > >  > >  > >> >> > >
> >  >----------------------------------------------------------
> > > >  > >  > >  >> >> > > ----
> > >  > > > > >  >>  >> >  >  ---
> > > > > > > > >>  >> > >  ---
> > > > > > > >  >>  >>   > > - To unsubscribe,
e-mail:
> > > > > > > >   >>  >> > > java-user-unsubscribe@lucene.apache.org
> >  > >  > >  > >  >> >> > > For  additional  commands,
 e-mail:
> > > > > > > >  >>  >> >  > java-user-help@lucene.apache.org
> >  > >  > >  > >  >> >> >
> >  > > > > >  > >> >>  >
> > >  > > > > > >>  >>  >
> > > >  > > > > >>  >>
> >  >--------------------------------------------------------------
> > >  > >  >  > > >> >> --
> > > > >  > > >   >> >> >  ---
> > > > >  > > > >> >>  > -- To unsubscribe,    e-mail:
> > > > > > > > >>  >> > java-user-unsubscribe@lucene.apache.org
> >  > >  > >  > >  >> >> > For  additional commands,
 e-mail:
> > > > > >  > >  >>  >> > java-user-help@lucene.apache.org
> >  > >  > >  > >  >> >>
> > >  > > > > >  >> >>
> > > > >  >  > > >> >>
> > > >  > > >  > >> >>
> >  >--------------------------------------------------------------
> > >  > > >  > >  >> >> ----
> > > >  > > > >   >> >> --- To  unsubscribe,  e-mail:
> > > > > > > >  >> >> java-user-unsubscribe@lucene.apache.org
> >  > >  > >  > >  >> >> For  additional commands,  e-mail:
> > > > > > >  > java-user-help@lucene.apache.org     >> >
> > > > > > > > >>  >
> > > > > > >  > >>  >
> >  > > > > > > >> >
> >  >---------------------------------------------------------------
> > >  > >  > >  > >> > ----
> > > > >  > > >   >> > -- To unsubscribe,  e-mail:
> >  > > > > > > java-user-unsubscribe@lucene.apache.org
> >  > >  > >  > >  >> > For additional  commands,  e-mail:
> > > > > > >  > java-user-help@lucene.apache.org     >> >
> > > > > > > > >>  >
> > > > > > >  >  >>
> > >  > > > > > >>
> > > > > > >  >  >>
> >  >-----------------------------------------------------------------
> >  > > >  > >  > >> ----
> > > > >  > > >  >> To  unsubscribe, e-mail: java-user-
> >  > > unsubscribe@lucene.apache.org
> >  > >  > >  > >  >> For additional  commands,  e-mail:
> > > > > > > > java-user-help@lucene.apache.org     >
> > > > > > > > >
> > > > >  > > > >
> > > >  > > > > >
> >  >------------------------------------------------------------------
> >  > > >  > >  > > ---
> > > > > >  > >  > To  unsubscribe, e-mail:
> > > > > >  > > java-user-unsubscribe@lucene.apache.org
> >  > >  > >  > >  > For additional commands,  e-mail:  java-user-
> > > > help@lucene.apache.org
> > >  > >  >  > > >
> > > > > > > >  >
> > > > > >  > >
> > > > > >  >  >
> > > > > > > >
> >  >--------------------------------------------------------------------
> >  > >  > >  > > - To  unsubscribe,  e-mail:
> > > > java-user-unsubscribe@lucene.apache.org
> >  > >  > >  > >  For additional commands,  e-mail:
> > > > java-user-help@lucene.apache.org   > >  >
> > > > > > >
> > > > >  > >
> >  ---------------------------------------------------------------------
> >  > >  > >  > To  unsubscribe, e-mail:
> > >  > java-user-unsubscribe@lucene.apache.org
> >  > >  > >  > For  additional commands,  e-mail:
> > > > java-user-help@lucene.apache.org   > >  >
> > > > > > >
> > > > >  >
> > > > >  >    
>---------------------------------------------------------------------
> >  > >  > > To  unsubscribe, e-mail:
> > > > java-user-unsubscribe@lucene.apache.org
> >  > >  > >  For additional commands, e-mail:
> > >  > java-user-help@lucene.apache.org   >
> > > > >
> > > > >
> > > > >  ------------------------------------------------------------------
> > >  > > ---
> > > >  > To  unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  > >  > For  additional commands, e-mail:
> > > > java-user-help@lucene.apache.org   >
> > > > >
> > > >
> > > >
> > >  >  --------------------------------------------------------------------
> >  > > - To  unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  > >  For additional commands, e-mail: java-user-help@lucene.apache.org
> >  >
> > >
> > >  ---------------------------------------------------------------------
> >  > To  unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  > For  additional commands, e-mail: java-user-help@lucene.apache.org
> >  >
> > >
> >
> >  ---------------------------------------------------------------------
> > To  unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To  unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For  additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message