lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierre Antoine DuBoDeNa <pad...@gmail.com>
Subject Re: fuzzy queries
Date Sun, 10 Feb 2013 15:13:57 GMT
anyone with an idea what's happening? I've tried 4-5 different queries..
many thresholds.. but can't get all results back..

2013/2/9 Pierre Antoine DuBoDeNa <padbdn@gmail.com>

> Well that's the issue.. I get back this result.. but i don't get other
> simpler..
>
> with query "string~ matching~" i get:
>
> Found 14 hits.
>
> 1. 1string 2matching
>
> 2. str_ing ma_tching
>
> 3. strang mutching
>
> 4. str4ing match2ing
>
> 5. string matching
>
> 6. string_m atching
>
> 7. string matching another token
>
> 8. strrring maatchinng
>
> 9. string matching123
>
> 10. string123 matching
>
> 11. strasding matc4hing ano23ther tok3en
>
> 12. str4ing maaatching_another 2t oken
>
> 13. strfffing_ m atcbbhing
>
> 14. string123 matching123
>
> 2013/2/9 Jack Krupansky <jack@basetechnology.com>
>
>> You probably are not getting this document returned:
>>
>>
>>    list.add("strfffing_ m atcbbhing");
>>
>> because... both terms have an edit distance greater than two.
>>
>> All the other documents have one or the other or both terms with an
>> editing distance of 2 or less.
>>
>> Your query is essentially: Match a document if EITHER term matches. So,
>> if NEITHER matches (within an editing distance of 2), the document is not a
>> match.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Pierre Antoine DuBoDeNa
>> Sent: Saturday, February 09, 2013 12:52 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: fuzzy queries
>>
>>
>> with query like string~ matching~ (without specifying threshold) i get 14
>> results back..
>>
>> Can it be problem with the analyzers?
>>
>> Here is the code:
>>
>> private File indexDir = new File("/a-directory-here");
>>
>> private StandardAnalyzer analyzer = new StandardAnalyzer(Version.**
>> LUCENE_35);
>>
>> private IndexWriterConfig config = new IndexWriterConfig(Version.**
>> LUCENE_35,
>> analyzer);
>>
>> public static void main(String[] args) throws Exception {
>>
>> IndexProfiles Indexer = new IndexProfiles();
>>
>> IndexWriter w = Indexer.CreateIndex();
>>
>> ArrayList<String> list = new ArrayList<String>();
>>
>> list.add("string matching");
>>
>> list.add("string123 matching");
>>
>> list.add("string matching123");
>>
>> list.add("string123 matching123");
>>
>> list.add("str4ing match2ing");
>>
>> list.add("1string 2matching");
>>
>> list.add("str_ing ma_tching");
>>
>> list.add("string_matching");
>>
>> list.add("strang mutching");
>>
>> list.add("strrring maatchinng");
>>
>> list.add("strfffing_ m atcbbhing");
>>
>> list.add("str2ing__mat3ching")**;
>>
>> list.add("string_m atching");
>>
>> list.add("string matching another token");
>>
>> list.add("strasding matc4hing ano23ther tok3en");
>>
>> list.add("str4ing maaatching_another 2t oken");
>>
>>
>>  for (String companyname:list)
>>
>> {
>>
>> Indexer.addSingleField(w, companyname);
>>
>> }
>>
>>
>>     int numDocs = w.numDocs();
>>
>>    System.out.println("# of Docs in Index: " + numDocs);
>>
>>    w.close();
>>
>>
>>            DoIndexQuery("string~ matching~");
>>
>>  }
>>
>> public static void DoIndexQuery(String query) throws IOException,
>> ParseException {
>>
>> IndexProfiles Indexer = new IndexProfiles();
>>
>>    IndexReader reader = Indexer.LoadIndex();
>>
>>
>>     Indexer.SearchIndex(reader, query, 50);
>>
>>
>>
>>    reader.close();
>>
>> }
>>
>>
>> public IndexWriter CreateIndex() throws IOException {
>>
>>
>>
>> Directory index = FSDirectory.open(indexDir);
>>
>> IndexWriter w = new IndexWriter(index, config);
>>
>> return w;
>>
>>
>>
>> }
>>
>>
>> public HashMap SearchIndex(IndexReader w, String query, int topk)
>> throwsIOException, ParseException {
>>
>>
>>
>>
>>  Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer
>> ).parse(query);
>>
>>
>>
>> IndexSearcher searcher = new IndexSearcher(w);
>>
>> TopScoreDocCollector collector = TopScoreDocCollector.create(**topk,
>> true);
>>
>> searcher.search(q, collector);
>>
>> ScoreDoc[] hits = collector.topDocs().scoreDocs;
>>
>>
>>
>> System.out.println("Found " + hits.length + " hits.");
>>
>> HashMap map = new HashMap();
>>
>> for(int i=0;i<hits.length;++i) {
>>
>>      int docId = hits[i].doc;
>>
>>      Document d = searcher.doc(docId);
>>
>>      map.put(docId, d.get("Name"));
>>
>>      System.out.println((i + 1) + ". " + d.get("Name"));
>>
>> }
>>
>>
>>         searcher.close();
>>
>>         return map;
>>
>>
>>
>> }
>>
>> public void addSingleField(IndexWriter w, String str) throws IOException {
>>
>>
>> Document doc = new Document();
>>
>> doc.add(new Field("Name", str, Field.Store.YES, Field.Index.ANALYZED));
>>
>> w.addDocument(doc);
>>
>> }
>>
>>
>>
>>
>>
>> 2013/2/9 Michael McCandless <lucene@mikemccandless.com>
>>
>>  Can you reduce your test case to indexing one document/field and
>>> running a single FuzzyQuery (you seem to be running two at once,
>>> OR'ing the results)?
>>>
>>> And show the complete standalone source code (eg what is topk?) so we
>>> can see how you are indexing / building the Query / searching.
>>>
>>> The default minSim is 0.5.
>>>
>>> Note that 0.01 is not useful in practice: it (should) match nearly all
>>> terms.  But I agree it's odd one term is not matching.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Sat, Feb 9, 2013 at 5:20 AM, Pierre Antoine DuBoDeNa
>>> <padbdn@gmail.com> wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> I use lucene 3.6 and i try to use fuzzy queries so that I can match
>>> >> much
>>> >> more results.
>>> >>
>>> >> I am adding for example these strings:
>>> >>
>>> >>  list.add("string matching");
>>> >>
>>> >> list.add("string123 matching");
>>> >>
>>> >> list.add("string matching123");
>>> >>
>>> >> list.add("string123 matching123");
>>> >>
>>> >> list.add("str4ing match2ing");
>>> >>
>>> >> list.add("1string 2matching");
>>> >>
>>> >> list.add("str_ing ma_tching");
>>> >>
>>> >> list.add("string_matching");
>>> >>
>>> >> list.add("strang mutching");
>>> >>
>>> >> list.add("strrring maatchinng");
>>> >>
>>> >> list.add("strfffing_ m atcbbhing");
>>> >>
>>> >> list.add("str2ing__mat3ching")**;
>>> >>
>>> >> list.add("string_m atching");
>>> >>
>>> >> list.add("string matching another token");
>>> >>
>>> >> list.add("strasding matc4hing ano23ther tok3en");
>>> >>
>>> >> list.add("str4ing maaatching_another 2t oken");
>>> >>
>>> >>
>>> >>
>>> >> then i do a query:
>>> >>
>>> >>
>>> >> "string~0.01 matching~0.01"
>>> >>
>>> >>
>>> >> and I get back these results:
>>> >>
>>> >>
>>> >> Found 15 hits.
>>> >>
>>> >> 1. 1string 2matching
>>> >>
>>> >> 2. str_ing ma_tching
>>> >>
>>> >> 3. string_m atching
>>> >>
>>> >> 4. strang mutching
>>> >>
>>> >> 5. str4ing match2ing
>>> >>
>>> >> 6. strrring maatchinng
>>> >>
>>> >> 7. string matching
>>> >>
>>> >> 8. strasding matc4hing ano23ther tok3en
>>> >>
>>> >> 9. string matching another token
>>> >>
>>> >> 10. string matching123
>>> >>
>>> >> 11. string123 matching
>>> >>
>>> >> 12. strfffing_ m atcbbhing
>>> >>
>>> >> 13. string123 matching123
>>> >>
>>> >> 14. str4ing maaatching_another 2t oken
>>> >>
>>> >> 15. string_matching
>>> >>
>>> >> So only 1 result is missing (with threshold 0.01).. str2ing__mat3ching
>>> any
>>> >> idea why? how can i extend the query to catch this one as well?
>>> >>
>>> >> Also what's the default threshold for the ~ operator? Without >>
>>> specifying
>>> >> threshold I get 14 results string_matching and str2ing__mat3ching
>>> missing
>>> >> this time.
>>> >>
>>> >> Here is the code for the queries
>>> >>
>>> >>
>>> >>  Query q = new QueryParser(Version.LUCENE_35, "Name", analyzer
>>> >> ).parse(query);
>>> >>
>>> >>
>>> >>
>>> >>  IndexSearcher searcher = new IndexSearcher(w);
>>> >>
>>> >>  TopScoreDocCollector collector = TopScoreDocCollector.create(**topk,
>>> true);
>>> >>
>>> >>  searcher.search(q, collector);
>>> >>
>>> >>  ScoreDoc[] hits = collector.topDocs().scoreDocs;
>>> >>
>>> >>
>>> >> Thanks for the help.
>>> >>
>>> >>
>>>
>>> ------------------------------**------------------------------**
>>> ---------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
>>> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>>>
>>>
>>>
>>
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
>> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message