lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: AW: AW: AW: "fuzzy prefix" search
Date Wed, 04 May 2011 11:48:06 GMT
Shingles won't to that either, so I suspect you'll have to write a custom
tokenizer.

Best
Erick

On Wed, May 4, 2011 at 2:07 AM, Clemens Wyss <clemensdev@mysign.ch> wrote:
> I know this is just an example.
> But even the WhitespaceAnalyzer takes the words apart, which I don't want. I would like
the phrases as they are (maximum 3 words, e.g. "Merlot del Ticino", ...) to be n-gram-ed.
I hence want to have the n-grams.
> Mer
> Merl
> Merlo
> Merlot
> Merlot
> Merlot d
> ...
>
> Regards
> Clemens
>> -----Ursprüngliche Nachricht-----
>> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>> Gesendet: Dienstag, 3. Mai 2011 23:12
>> An: java-user@lucene.apache.org
>> Betreff: Re: AW: AW: AW: "fuzzy prefix" search
>>
>> Clemens - that's just an example.  Stick another tokenizer in there, like
>> WhitespaceTokenizer in there, for example.
>>
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem
>> search :: http://search-lucene.com/
>>
>>
>>
>> ----- Original Message ----
>> > From: Clemens Wyss <clemensdev@mysign.ch>
>> > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
>> > Sent: Tue, May 3, 2011 4:31:14 PM
>> > Subject: AW: AW: AW: "fuzzy prefix" search
>> >
>> > But doesn't the KeyWordTokenizer extract single words out oft he
>> >stream? I would  like to create n-grams on the stream (field content) as it
>> is...
>> >
>> > >  -----Ursprüngliche Nachricht-----
>> > > Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>> > >  Gesendet: Dienstag, 3. Mai 2011 21:31
>> > > An: java-user@lucene.apache.org
>> > >  Betreff: Re: AW: AW: "fuzzy prefix" search
>> > >
>> > > Clemens,
>> > >
>> > > Something a la:
>> > >
>> > > public TokenStream tokenStream (String  fieldName, Reader r) {
>> > >   return nw EdgeNGramTokenFilter(new  KeywordTokenizer(r),
>> > > EdgeNGramTokenFilter.Side.FRONT, 1, 4); }
>> > >
>> > >
>> > > Check out page 265 of Lucene in Action 2.
>> > >
>> > >  Otis
>> > > ----
>> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene
>> > > ecosystem search :: http://search-lucene.com/
>> > >
>> > >
>> > >
>> > > ----- Original  Message ----
>> > > > From: Clemens Wyss <clemensdev@mysign.ch>
>> > > > To:  "java-user@lucene.apache.org"  <java-user@lucene.apache.org>
>> > >  > Sent: Tue, May 3, 2011 12:57:39 PM
>> > > > Subject: AW: AW: "fuzzy  prefix" search
>> > > >
>> > > > How does an simple Analyzer look that  just "n-grams" the  docs/fields.
>> > > >
>> > > > class  SimpleNGramAnalyzer extends  Analyzer {  @Override
>> > > > public TokenStream tokenStream ( String fieldName,   Reader reader
)
>> > > > {
>> > > >     EdgeNGramTokenFilter...  ???
>> > > > }
>> > > > }
>> > >  >
>> > > > > -----Ursprüngliche Nachricht-----
>> > > > > Von:  Otis  Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>> > >  > >  Gesendet: Dienstag, 3. Mai 2011 13:36
>> > > > > An: java-user@lucene.apache.org
>> > >  > >  Betreff: Re: AW: "fuzzy prefix" search
>> > > >  >
>> > > > > Hi,
>> > > > >
>> > > > > I  didn't  read this thread closely, but just in case:
>> > > > > * Is this  something  you can handle with synonyms?
>> > > > > * If this is for  English and you are  trying to handle typos,
>> > > > > there is a
>> >list
>> > >  >of
>> > > > > common English misspellings  out there that you  could use
for
>> > > > > this
>> > > perhaps.
>> > > > > * Have you  considered  n-gramming your tokens?  Not sure
if
>> > > > > this would
>> > >  help,
>> > > > > didn't read  messages/examples closely enough, but  you may
want
>> > > > > to
>> > > look at
>> > > > > this if  you haven't done  so yet.
>> > > > >
>> > > > > Otis
>> > > > > ----
>> > >  > > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
>> > > Lucene  ecosystem
>> > > > > search :: http://search-lucene.com/
>> > > > >
>> > > >  >
>> > > > >
>> > > > > ----- Original  Message  ----
>> > > > > > From: Clemens Wyss <clemensdev@mysign.ch>
>> > > >  > > To:  "java-user@lucene.apache.org"   <java-
>> user@lucene.apache.org>
>> > >  > >  > Sent: Tue, May 3, 2011 5:25:30 AM
>> > > > > >  Subject: AW: "fuzzy prefix"  search
>> > > > > >
>> > > >  > > >PrefixQuery
>> > > > > > I'd like the  combination  of prefix and fuzzy ;-) because
>> > > > > > people
>> >could
>> > > > >  >also  type "menlo" or "märl" and in any of these cases
I'd
>> > > > > like  to
>> >get
>> > > > >  >a hit on Merlot (for suggesting  Merlot)
>> > > > > >
>> > > > > > >   -----Ursprüngliche  Nachricht-----
>> > > > > > > Von: Ian  Lea  [mailto:ian.lea@gmail.com]
>> > > > > >  >  Gesendet:  Dienstag, 3. Mai 2011 11:22  >
An:
>> > > > > > java-user@lucene.apache.org
>> > >  > >  > >  Betreff: Re: "fuzzy prefix" search
>> > > >  > > >
>> > > > >  > > I'd assumed that  FuzzyQuery  wouldn't ignore case
but I
>> > > > > could be
>> > > wrong.
>> > > >  > > >  What would be the edit  distance between  "mer"
 and "merlot"?
>> > > Would
>> > > > > > > it be less that 1.5  which I   reckon would be the
value of
>> > > > > > >  length(term)*0.5 as detailed in  the  javadocs?
 Seems
>> > > > > > > unlikely,
>> >but
>> > > > > > > I don't really  know anything about   the Levenshtein
(edit
>> distance)
>> > > > > algorithm as  used by  FuzzyQuery.
>> > > > > > >  Wouldn't a PrefixQuery be  more  appropriate here?
>> > > > > > >
>> > > > > >  >
>> > > > > > >   --
>> > > > > > >  Ian.
>> > > > > > >
>> > > > > > > On Tue, May  3,  2011 at 10:10 AM, Clemens Wyss
>> > > > > > > <clemensdev@mysign.ch>
>> > > >  > >  >  wrote:
>> > > > > > > > Unfortunately  lowercasing doesn't  help.
>> > > > > > > > Also,   doesn't the FuzzyQuery ignore  casing?
>> > > > > > >  >
>> > > > > > > >>   -----Ursprüngliche  Nachricht-----
>> > > > > > > >> Von: Ian Lea   [mailto:ian.lea@gmail.com]
>> > > > > >  >  >>  Gesendet: Dienstag, 3. Mai 2011 11:06
>> > > >  > > > >>  An: java-user@lucene.apache.org
>> > >  > >  > >  >> Betreff: Re: "fuzzy prefix"  search
>> > > > > > >  >>
>> > > > > > >  >>  Mer != mer.  The latter will be  what
is indexed
>> > > > > > > because
>> > > > > > > >> StandardAnalyzer calls   LowerCaseFilter.
>> > > > > > > >>
>> > > > > >  > >>   --
>> > > > > > > >> Ian.
>> > > >  > > > >>
>> > > > > > >  >>
>> > > >  > > > >> On  Tue, May 3, 2011 at 9:56 AM,  Clemens
 Wyss
>> > > > > > > <clemensdev@mysign.ch>
>> > > >  > >  > >>  wrote:
>> > > > > > > >>  > Sorry for coming back  to my issue.
Can anybody
>> > > > > > > >> explain why
>> >my
>> > > > > > > "simple"
>> > > > >  > >  >> unit test below fails? Any  hint/help
 appreciated.
>> > >  > > > > >> >
>> > > > > > > >>  >  Directory  directory = new RAMDirectory();
>> > > > > > > >> IndexWriter
>> > >  > > > >  >> > indexWriter =  new IndexWriter(
 directory, new
>> > > > > >  > >> >  StandardAnalyzer(
>> > > > > > >   Version.LUCENE_31
>> > >  > > > > >> > ),   IndexWriter.MaxFieldLength.UNLIMITED
 ); Document
>> > > document
>> > >  > >  =
>> > > > > > > new
>> > > > > > >  >> > Document();   document.add( new Field(
"test",  "Merlot",
>> > > > > > > >> >  Field.Store.YES,  Field.Index.ANALYZED
) );
>> > > > > > > >> >   indexWriter.addDocument(
>> > > > > > >  >> >  document );  IndexReader indexReader
=
>> > > > > > > indexWriter.getReader();
>> > > > >  > > >> >  IndexSearcher searcher = new  IndexSearcher(
>> > > > > indexReader );
>> > >  > > > > >> > Query q = new FuzzyQuery(   new Term(
 "test", "Mer" ),
>> 0.5f,
>> >0,
>> > > > > > > >> > 10 ); //  or  Query q =  new FuzzyQuery(
new Term(
>> > > > > > > >> > "test",
>> "Mer"
>> > > >  > > >  >> > ), 0.5f); TopDocs  result =  searcher.search(
q, 10
>> > > > );
>> > > > >  > > >> >  Assert.assertEquals( 1,  result.totalHits
 );
>> > > > > >  > >> >
>> > > > > > > >> > -    Clemens
>> > > > > > > >> >
>> > > > > > >  >> >>  -----Ursprüngliche  Nachricht-----
>> > > >  > > > >> >> Von:  Clemens Wyss [mailto:clemensdev@mysign.ch]
>> > > > >  > >  >>  >> Gesendet: Montag, 2. Mai 2011
 23:01
>> > > > > > >  >> >> An: java-user@lucene.apache.org
>> > >  > >  > >  >> >> Betreff: AW: "fuzzy prefix"
 search
>> > > > >  > > >>  >>
>> > > >  > > > >> >> Is it the  combination of FuzzyQuery
and  Term  which
>> > > > makes
>> >the
>> > > > > > >  >> >>  search to go for "word  boundaries"?
>> > > > > > >   >> >>
>> > > > > > > >> >> >    -----Ursprüngliche Nachricht-----
>> > > > > > > >> >>  > Von:  Clemens  Wyss [mailto:clemensdev@mysign.ch]
>> > > > >  > >  >>  >> > Gesendet: Montag, 2. Mai
2011  14:13
>> > > > > >  > >> >> >  An: java-user@lucene.apache.org
>> > >  > >  > >  >> >> > Betreff: AW: "fuzzy  prefix"
 search
>> > > > > > > >>  >>  >
>> > > > > > > >>  >> > I tried this too,  but unfortunately
 I only get
>> > > > > > > >> hits  when
>> > > > > > >  >> >> > the search term is a least
  as long as the word to  be
>> >looked
>> > > up.
>> > > > > > > >> >>   >
>> > > > > > >  >> >> > E.g.:
>> > > >  > > >  >> >> > ...
>> > > > > > >  >>  >> >  Directory directory =
new RAMDirectory();  IndexWriter
>> > > > > > >   >> >> >  indexWriter = new IndexWriter(
directory,  >>  >>
>> > > > > > > > IndexManager.getIndexingAnalyzer(
>> > > >  > >  > >>  >> LOCALE_DE ),
>> > > > > >  > >> >>  >                IndexWriter.MaxFieldLength.UNLIMITED
);
>> > > > > > >  >> >>  >
>> > > > > > >  >> >>  > Document document = new
 Document();
>> document.add(
>> > > new
>> > >  > > > > Field(
>> > > > > > >  >> >>  > "test", "Merlot",
>> > > > > > > >>   >>  >             Field.Store.YES,
  Field.Index.ANALYZED ) );
>> > > > > > >  >>  >>  indexWriter.addDocument(
>> > > > > > > >>  >> >  document  );
>> > > > > > > >>  >> >
>> > > > > > > >> >>  >   IndexReader indexReader
= indexWriter.getReader();
>> > > > > >  >  >> >> > IndexSearcher
>> > > > > >  >  >> >>  > searcher = new IndexSearcher(
 indexReader );  >>
>> > > > > > >>  >
>> > > > > > >  >> >> > Query q = new FuzzyQuery(
  new Term( "test", "Mer"  ),
>> >0.6f,
>> > > > > > > >> >> > 1 );   TopDocs  result = searcher.search(
q, 10 );
>> > > > > > >  >>  >> > Assert.assertEquals(  >>
 >>  > 1,
>> > > > > > > >>  >> result.totalHits ); ...
>> > > > > >  >   >> >> >
>> > > > > > > >> >> >  >  -----Ursprüngliche
 Nachricht-----
>> > > > > > >  >> >> >  > Von: Uwe Schindler
[mailto:uwe@thetaphi.de]  >>
>> > > > > > > >>  > > Gesendet: Montag, 2. Mai 2011
 13:50  >> >> >  > An:
>> > > > > > > java-user@lucene.apache.org
>> > >  > >  > >  >> >> > > Betreff: RE: "fuzzy
 prefix"  search
>> > > > > > > >>  >> >  >
>> > > > > > >  >> >> > > Hi,
>> > >  > > > > >>  >> >  >
>> > > > >  > > >> >> > > You can pass an integer
  to  FuzzyQuery which defines
>> the
>> > > > > > > >> >> >  >  number of  characters
that are seen as prefix.
>> > > > > > > >> >> > So all
>> > >  > > > >  >> >> > > terms must match
>> > >  > > > > >>   >> > > this prefix and the
rest  of each term is matched using
>> > > >fuzzy.
>> > > > > > >  >> >> > >
>> > > > > > > >>  >>  > >  Uwe
>> > > > > > > >> >> >   >
>> > > > > > > >> >> > >  -----
>> > >  > > > >  >> >> > > Uwe Schindler
>> > > >  > > > >>   >> > > H.-H.-Meier-Allee
63, D-28213  Bremen
>> > > > > > >  >> http://www.thetaphi.de
>> > >  > > > >  >> >> > >  eMail: uwe@thetaphi.de
>> > > > > > >  >>  >> >  >
>> > > > > > > >>  >> > > >  -----Original
Message-----
>> > > > > >  >  >> >> > >  > From: Clemens
Wyss
>> > > > > > [mailto:clemensdev@mysign.ch]
>> > > > >  > >  >>  >> > > > Sent: Monday,
May 02,  2011 1:47 PM   >> > > > To:
>> > > > > > >  >> java-user@lucene.apache.org
>> > >  > >  > >  >> >> > > > Subject:  "fuzzy
prefix"  search  >> >> > > >
>> > > >  > > > >>  >> > > > I'd  like to
search  fuzzily but not on a full  term.
>> > > > > > > >>  >> >  > > E.g.
>> > > > > >  > >>  >> > > > I have a text
"Merlot  del  Ticino"
>> > >  > > > > >> >> > > > I'd like
>> > > > >  >  > >>  >> > > > "mer", "merr",
"melo",  ... to  match.
>> > > > > > > >>  >> >  > >
>> > > > > >  > >> >> > > > If  I use  FuzzyQuery
only  "merlot,  "merlott" hit.
>> >What
>> > >  > > > > >> >>  >  > >  Query-combination
should I use?
>> > > > > > > >>   >> > >  >
>> > > > > > > >> >> >  > >  Thx
>> > > > > > > >> >> >   > > Clemens
>> > > > >  > > >> >> > >  >
>> > > > > > > >>   >> > > >
>> > >  > > > > >> >> > >  >
>> > > > >  > > >> >> > > >
>> >--------------------------------------------------------
>> > > > > >  >  >> >> > > > ----
>> > > > > > >  >>  >>  > > > ---
>> > > > > > >  >> >> > > >  ---
>> > > > > > >   >> >> > > > --
>> > > > > >  > >>  >> > > > -  To unsubscribe,
e-mail:
>> > > > >   > > >> >> > > > java-user-unsubscribe@lucene.apache.org
>> > >  > >  > >  >> >> > > > For additional
 commands,  e-mail:
>> > > > > > >  >> >> >  > > java-user-help@lucene.apache.org
   >> >> > >
>> > > > > > > >> >>  >  >
>> > > > > > >  >> >> >  >
>> > > > > > >  >> >> > >
>> > > >  > > > >> >> > >
>> >----------------------------------------------------------
>> > > > >  > >  >> >> > > ----
>> > > > > > >  >>  >> >  > ---
>> > > > > > > >>  >> > > ---
>> > > > > > >  >>  >>  > > - To unsubscribe,
e-mail:
>> > > > > > >  >>  >> > > java-user-unsubscribe@lucene.apache.org
>> > >  > >  > >  >> >> > > For additional  commands,
 e-mail:
>> > > > > > > >>  >> >  > java-user-help@lucene.apache.org
>> > >  > >  > >  >> >> >
>> > > > > >  > >> >>  >
>> > > > > > > >>  >>  >
>> > > > > > > >>  >>
>> >--------------------------------------------------------------
>> > > >  >  > > >> >> --
>> > > > > > >   >> >> >  ---
>> > > > > > > >> >>  > -- To unsubscribe,   e-mail:
>> > > > > > > >>  >> > java-user-unsubscribe@lucene.apache.org
>> > >  > >  > >  >> >> > For additional commands,
 e-mail:
>> > > > >  > > >>  >> > java-user-help@lucene.apache.org
>> > >  > >  > >  >> >>
>> > > > > > >  >> >>
>> > > > >  > > >> >>
>> > >  > > > > >> >>
>> >--------------------------------------------------------------
>> > > > >  > >  >> >> ----
>> > > > > > >   >> >> --- To  unsubscribe, e-mail:
>> > > > > > >  >> >> java-user-unsubscribe@lucene.apache.org
>> > >  > >  > >  >> >> For additional commands,  e-mail:
>> > > > > >  > java-user-help@lucene.apache.org    >> >
>> > > > > > > >> >
>> > > > > >  > >>  >
>> > > > > > > >> >
>> >---------------------------------------------------------------
>> > > >  > >  > >> > ----
>> > > > > > >   >> > -- To unsubscribe,  e-mail:
>> > > > > > > java-user-unsubscribe@lucene.apache.org
>> > >  > >  > >  >> > For additional commands,  e-mail:
>> > > > > >  > java-user-help@lucene.apache.org    >> >
>> > > > > > > >> >
>> > > > > >  >  >>
>> > > > > > > >>
>> > > > > >  > >>
>> >-----------------------------------------------------------------
>> > > >  > >  > >> ----
>> > > > > > >  >> To  unsubscribe, e-mail: java-user-
>> > > unsubscribe@lucene.apache.org
>> > >  > >  > >  >> For additional commands,  e-mail:
>> > > > > > > java-user-help@lucene.apache.org    >
>> > > > > > > >
>> > > > > > > >
>> > >  > > > > >
>> >------------------------------------------------------------------
>> > > >  > >  > > ---
>> > > > > > >  > To  unsubscribe, e-mail:
>> > > > > > > java-user-unsubscribe@lucene.apache.org
>> > >  > >  > >  > For additional commands, e-mail:  java-user-
>> > > help@lucene.apache.org
>> > > >  >  > > >
>> > > > > > > >
>> > > > >  > >
>> > > > > >  >
>> > > > > > >
>> >--------------------------------------------------------------------
>> > >  > >  > > - To  unsubscribe, e-mail:
>> > > java-user-unsubscribe@lucene.apache.org
>> > >  > >  > >  For additional commands, e-mail:
>> > > java-user-help@lucene.apache.org  > >  >
>> > > > > >
>> > > > > >
>> ---------------------------------------------------------------------
>> > >  > >  > To  unsubscribe, e-mail:
>> > > java-user-unsubscribe@lucene.apache.org
>> > >  > >  > For  additional commands, e-mail:
>> > > java-user-help@lucene.apache.org  > >  >
>> > > > > >
>> > > > >
>> > > >  >   ---------------------------------------------------------------------
>> > >  > > To  unsubscribe, e-mail:
>> > > java-user-unsubscribe@lucene.apache.org
>> > >  > >  For additional commands, e-mail:
>> > > java-user-help@lucene.apache.org  >
>> > > >
>> > > >
>> > > > ------------------------------------------------------------------
>> > > > ---
>> > >  > To  unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > >  > For  additional commands, e-mail:
>> > > java-user-help@lucene.apache.org  >
>> > > >
>> > >
>> > >
>> > > --------------------------------------------------------------------
>> > > - To  unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > >  For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To  unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For  additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message