lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierre Antoine DuBoDeNa <pad...@gmail.com>
Subject Re: fuzzy queries
Date Sat, 09 Feb 2013 18:24:38 GMT
Well that's the issue.. I get back this result.. but i don't get other
simpler..

with query "string~ matching~" i get:

Found 14 hits.

1. 1string 2matching

2. str_ing ma_tching

3. strang mutching

4. str4ing match2ing

5. string matching

6. string_m atching

7. string matching another token

8. strrring maatchinng

9. string matching123

10. string123 matching

11. strasding matc4hing ano23ther tok3en

12. str4ing maaatching_another 2t oken

13. strfffing_ m atcbbhing

14. string123 matching123

2013/2/9 Jack Krupansky <jack@basetechnology.com>

> You probably are not getting this document returned:
>
>
>    list.add("strfffing_ m atcbbhing");
>
> because... both terms have an edit distance greater than two.
>
> All the other documents have one or the other or both terms with an
> editing distance of 2 or less.
>
> Your query is essentially: Match a document if EITHER term matches. So, if
> NEITHER matches (within an editing distance of 2), the document is not a
> match.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Pierre Antoine DuBoDeNa
> Sent: Saturday, February 09, 2013 12:52 PM
> To: java-user@lucene.apache.org
> Subject: Re: fuzzy queries
>
>
> with query like string~ matching~ (without specifying threshold) i get 14
> results back..
>
> Can it be problem with the analyzers?
>
> Here is the code:
>
> private File indexDir = new File("/a-directory-here");
>
> private StandardAnalyzer analyzer = new StandardAnalyzer(Version.**
> LUCENE_35);
>
> private IndexWriterConfig config = new IndexWriterConfig(Version.**
> LUCENE_35,
> analyzer);
>
> public static void main(String[] args) throws Exception {
>
> IndexProfiles Indexer = new IndexProfiles();
>
> IndexWriter w = Indexer.CreateIndex();
>
> ArrayList<String> list = new ArrayList<String>();
>
> list.add("string matching");
>
> list.add("string123 matching");
>
> list.add("string matching123");
>
> list.add("string123 matching123");
>
> list.add("str4ing match2ing");
>
> list.add("1string 2matching");
>
> list.add("str_ing ma_tching");
>
> list.add("string_matching");
>
> list.add("strang mutching");
>
> list.add("strrring maatchinng");
>
> list.add("strfffing_ m atcbbhing");
>
> list.add("str2ing__mat3ching")**;
>
> list.add("string_m atching");
>
> list.add("string matching another token");
>
> list.add("strasding matc4hing ano23ther tok3en");
>
> list.add("str4ing maaatching_another 2t oken");
>
>
>  for (String companyname:list)
>
> {
>
> Indexer.addSingleField(w, companyname);
>
> }
>
>
>     int numDocs = w.numDocs();
>
>    System.out.println("# of Docs in Index: " + numDocs);
>
>    w.close();
>
>
>            DoIndexQuery("string~ matching~");
>
>  }
>
> public static void DoIndexQuery(String query) throws IOException,
> ParseException {
>
> IndexProfiles Indexer = new IndexProfiles();
>
>    IndexReader reader = Indexer.LoadIndex();
>
>
>     Indexer.SearchIndex(reader, query, 50);
>
>
>
>    reader.close();
>
> }
>
>
> public IndexWriter CreateIndex() throws IOException {
>
>
>
> Directory index = FSDirectory.open(indexDir);
>
> IndexWriter w = new IndexWriter(index, config);
>
> return w;
>
>
>
> }
>
>
> public HashMap SearchIndex(IndexReader w, String query, int topk)
> throwsIOException, ParseException {
>
>
>
>
>  Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer
> ).parse(query);
>
>
>
> IndexSearcher searcher = new IndexSearcher(w);
>
> TopScoreDocCollector collector = TopScoreDocCollector.create(**topk,
> true);
>
> searcher.search(q, collector);
>
> ScoreDoc[] hits = collector.topDocs().scoreDocs;
>
>
>
> System.out.println("Found " + hits.length + " hits.");
>
> HashMap map = new HashMap();
>
> for(int i=0;i<hits.length;++i) {
>
>      int docId = hits[i].doc;
>
>      Document d = searcher.doc(docId);
>
>      map.put(docId, d.get("Name"));
>
>      System.out.println((i + 1) + ". " + d.get("Name"));
>
> }
>
>
>         searcher.close();
>
>         return map;
>
>
>
> }
>
> public void addSingleField(IndexWriter w, String str) throws IOException {
>
>
> Document doc = new Document();
>
> doc.add(new Field("Name", str, Field.Store.YES, Field.Index.ANALYZED));
>
> w.addDocument(doc);
>
> }
>
>
>
>
>
> 2013/2/9 Michael McCandless <lucene@mikemccandless.com>
>
>  Can you reduce your test case to indexing one document/field and
>> running a single FuzzyQuery (you seem to be running two at once,
>> OR'ing the results)?
>>
>> And show the complete standalone source code (eg what is topk?) so we
>> can see how you are indexing / building the Query / searching.
>>
>> The default minSim is 0.5.
>>
>> Note that 0.01 is not useful in practice: it (should) match nearly all
>> terms.  But I agree it's odd one term is not matching.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Sat, Feb 9, 2013 at 5:20 AM, Pierre Antoine DuBoDeNa
>> <padbdn@gmail.com> wrote:
>> >>
>> >> Hello,
>> >>
>> >> I use lucene 3.6 and i try to use fuzzy queries so that I can match >>
>> much
>> >> more results.
>> >>
>> >> I am adding for example these strings:
>> >>
>> >>  list.add("string matching");
>> >>
>> >> list.add("string123 matching");
>> >>
>> >> list.add("string matching123");
>> >>
>> >> list.add("string123 matching123");
>> >>
>> >> list.add("str4ing match2ing");
>> >>
>> >> list.add("1string 2matching");
>> >>
>> >> list.add("str_ing ma_tching");
>> >>
>> >> list.add("string_matching");
>> >>
>> >> list.add("strang mutching");
>> >>
>> >> list.add("strrring maatchinng");
>> >>
>> >> list.add("strfffing_ m atcbbhing");
>> >>
>> >> list.add("str2ing__mat3ching")**;
>> >>
>> >> list.add("string_m atching");
>> >>
>> >> list.add("string matching another token");
>> >>
>> >> list.add("strasding matc4hing ano23ther tok3en");
>> >>
>> >> list.add("str4ing maaatching_another 2t oken");
>> >>
>> >>
>> >>
>> >> then i do a query:
>> >>
>> >>
>> >> "string~0.01 matching~0.01"
>> >>
>> >>
>> >> and I get back these results:
>> >>
>> >>
>> >> Found 15 hits.
>> >>
>> >> 1. 1string 2matching
>> >>
>> >> 2. str_ing ma_tching
>> >>
>> >> 3. string_m atching
>> >>
>> >> 4. strang mutching
>> >>
>> >> 5. str4ing match2ing
>> >>
>> >> 6. strrring maatchinng
>> >>
>> >> 7. string matching
>> >>
>> >> 8. strasding matc4hing ano23ther tok3en
>> >>
>> >> 9. string matching another token
>> >>
>> >> 10. string matching123
>> >>
>> >> 11. string123 matching
>> >>
>> >> 12. strfffing_ m atcbbhing
>> >>
>> >> 13. string123 matching123
>> >>
>> >> 14. str4ing maaatching_another 2t oken
>> >>
>> >> 15. string_matching
>> >>
>> >> So only 1 result is missing (with threshold 0.01).. str2ing__mat3ching
>> any
>> >> idea why? how can i extend the query to catch this one as well?
>> >>
>> >> Also what's the default threshold for the ~ operator? Without >>
>> specifying
>> >> threshold I get 14 results string_matching and str2ing__mat3ching
>> missing
>> >> this time.
>> >>
>> >> Here is the code for the queries
>> >>
>> >>
>> >>  Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer
>> >> ).parse(query);
>> >>
>> >>
>> >>
>> >>  IndexSearcher searcher = new IndexSearcher(w);
>> >>
>> >>  TopScoreDocCollector collector = TopScoreDocCollector.create(**topk,
>> true);
>> >>
>> >>  searcher.search(q, collector);
>> >>
>> >>  ScoreDoc[] hits = collector.topDocs().scoreDocs;
>> >>
>> >>
>> >> Thanks for the help.
>> >>
>> >>
>>
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
>> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>>
>>
>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message