lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierre Antoine DuBoDeNa <pad...@gmail.com>
Subject Re: fuzzy queries
Date Sat, 09 Feb 2013 17:52:38 GMT
with query like string~ matching~ (without specifying threshold) i get 14
results back..

Can it be problem with the analyzers?

Here is the code:

private File indexDir = new File("/a-directory-here");

private StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

private IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
analyzer);

public static void main(String[] args) throws Exception {

 IndexProfiles Indexer = new IndexProfiles();

IndexWriter w = Indexer.CreateIndex();

ArrayList<String> list = new ArrayList<String>();

 list.add("string matching");

list.add("string123 matching");

list.add("string matching123");

list.add("string123 matching123");

list.add("str4ing match2ing");

list.add("1string 2matching");

list.add("str_ing ma_tching");

list.add("string_matching");

list.add("strang mutching");

list.add("strrring maatchinng");

list.add("strfffing_ m atcbbhing");

list.add("str2ing__mat3ching");

list.add("string_m atching");

list.add("string matching another token");

list.add("strasding matc4hing ano23ther tok3en");

list.add("str4ing maaatching_another 2t oken");


  for (String companyname:list)

{

Indexer.addSingleField(w, companyname);

}


     int numDocs = w.numDocs();

    System.out.println("# of Docs in Index: " + numDocs);

    w.close();


            DoIndexQuery("string~ matching~");

  }

public static void DoIndexQuery(String query) throws IOException,
ParseException {

 IndexProfiles Indexer = new IndexProfiles();

    IndexReader reader = Indexer.LoadIndex();


     Indexer.SearchIndex(reader, query, 50);



    reader.close();

 }


public IndexWriter CreateIndex() throws IOException {



 Directory index = FSDirectory.open(indexDir);

 IndexWriter w = new IndexWriter(index, config);

 return w;



 }


 public HashMap SearchIndex(IndexReader w, String query, int topk)
throwsIOException, ParseException {



  Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer
).parse(query);



 IndexSearcher searcher = new IndexSearcher(w);

 TopScoreDocCollector collector = TopScoreDocCollector.create(topk, true);

 searcher.search(q, collector);

 ScoreDoc[] hits = collector.topDocs().scoreDocs;



 System.out.println("Found " + hits.length + " hits.");

 HashMap map = new HashMap();

 for(int i=0;i<hits.length;++i) {

      int docId = hits[i].doc;

      Document d = searcher.doc(docId);

      map.put(docId, d.get("Name"));

      System.out.println((i + 1) + ". " + d.get("Name"));

 }


         searcher.close();

         return map;



 }

public void addSingleField(IndexWriter w, String str) throws IOException {


Document doc = new Document();

doc.add(new Field("Name", str, Field.Store.YES, Field.Index.ANALYZED));

w.addDocument(doc);

}





2013/2/9 Michael McCandless <lucene@mikemccandless.com>

> Can you reduce your test case to indexing one document/field and
> running a single FuzzyQuery (you seem to be running two at once,
> OR'ing the results)?
>
> And show the complete standalone source code (eg what is topk?) so we
> can see how you are indexing / building the Query / searching.
>
> The default minSim is 0.5.
>
> Note that 0.01 is not useful in practice: it (should) match nearly all
> terms.  But I agree it's odd one term is not matching.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Sat, Feb 9, 2013 at 5:20 AM, Pierre Antoine DuBoDeNa
> <padbdn@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >> I use lucene 3.6 and i try to use fuzzy queries so that I can match much
> >> more results.
> >>
> >> I am adding for example these strings:
> >>
> >>  list.add("string matching");
> >>
> >> list.add("string123 matching");
> >>
> >> list.add("string matching123");
> >>
> >> list.add("string123 matching123");
> >>
> >> list.add("str4ing match2ing");
> >>
> >> list.add("1string 2matching");
> >>
> >> list.add("str_ing ma_tching");
> >>
> >> list.add("string_matching");
> >>
> >> list.add("strang mutching");
> >>
> >> list.add("strrring maatchinng");
> >>
> >> list.add("strfffing_ m atcbbhing");
> >>
> >> list.add("str2ing__mat3ching");
> >>
> >> list.add("string_m atching");
> >>
> >> list.add("string matching another token");
> >>
> >> list.add("strasding matc4hing ano23ther tok3en");
> >>
> >> list.add("str4ing maaatching_another 2t oken");
> >>
> >>
> >>
> >> then i do a query:
> >>
> >>
> >> "string~0.01 matching~0.01"
> >>
> >>
> >> and I get back these results:
> >>
> >>
> >> Found 15 hits.
> >>
> >> 1. 1string 2matching
> >>
> >> 2. str_ing ma_tching
> >>
> >> 3. string_m atching
> >>
> >> 4. strang mutching
> >>
> >> 5. str4ing match2ing
> >>
> >> 6. strrring maatchinng
> >>
> >> 7. string matching
> >>
> >> 8. strasding matc4hing ano23ther tok3en
> >>
> >> 9. string matching another token
> >>
> >> 10. string matching123
> >>
> >> 11. string123 matching
> >>
> >> 12. strfffing_ m atcbbhing
> >>
> >> 13. string123 matching123
> >>
> >> 14. str4ing maaatching_another 2t oken
> >>
> >> 15. string_matching
> >>
> >> So only 1 result is missing (with threshold 0.01).. str2ing__mat3ching
> any
> >> idea why? how can i extend the query to catch this one as well?
> >>
> >> Also what's the default threshold for the ~ operator? Without specifying
> >> threshold I get 14 results string_matching and str2ing__mat3ching
> missing
> >> this time.
> >>
> >> Here is the code for the queries
> >>
> >>
> >>  Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer
> >> ).parse(query);
> >>
> >>
> >>
> >>  IndexSearcher searcher = new IndexSearcher(w);
> >>
> >>  TopScoreDocCollector collector = TopScoreDocCollector.create(topk,
> true);
> >>
> >>  searcher.search(q, collector);
> >>
> >>  ScoreDoc[] hits = collector.topDocs().scoreDocs;
> >>
> >>
> >> Thanks for the help.
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message