Return-Path: Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: (qmail 79176 invoked from network); 30 Jan 2009 18:07:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Jan 2009 18:07:45 -0000 Received: (qmail 92971 invoked by uid 500); 30 Jan 2009 18:07:41 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 92934 invoked by uid 500); 30 Jan 2009 18:07:41 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 92923 invoked by uid 99); 30 Jan 2009 18:07:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Jan 2009 10:07:41 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of markrmiller@gmail.com designates 209.85.217.13 as permitted sender) Received: from [209.85.217.13] (HELO mail-gx0-f13.google.com) (209.85.217.13) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Jan 2009 18:07:31 +0000 Received: by gxk6 with SMTP id 6so486560gxk.5 for ; Fri, 30 Jan 2009 10:07:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=C1OqhcN4RpJ6ugy1glrPKMasHTWn8T1gSpuifa3TP1E=; b=kPbkmM/Wiub1/+S2CHDIxIHS+h7Tf1AnsVJCvw1DUV7tbCTJurZBXRLo4DhqwLnGok kDP4zdRAkswySEZ3CsdIEZnJxIuasA/MLs5ldhyAjJCwcNSfosEpuw5/NR82Rawt8YD6 gHvzyf3hoXxXAFzww4/9FCXpDbj187y7vPvow= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=fyOTX03Y9TSCYUitcPLC6d13gPm4h8QLxOF8l/vN3ZktjyIL3YSBhNPA6Tv+hzG3AG UZBYOyZwknrqPDQG/hkPLH4DYe9WQwa80Yc1XjxzyPZW64wbgoN+mGqt4ICmO/Y/kw4n hc1lMqplUoHToAz1+A9fza19OlUlt/mCLTLuc= Received: by 10.90.106.4 with SMTP id e4mr1139424agc.76.1233338830473; Fri, 30 Jan 2009 10:07:10 -0800 (PST) Received: from ?192.168.1.103? (ool-44c639d9.dyn.optonline.net [68.198.57.217]) by mx.google.com with ESMTPS id 38sm2592923agd.21.2009.01.30.10.07.09 (version=SSLv3 cipher=RC4-MD5); Fri, 30 Jan 2009 10:07:10 -0800 (PST) Message-ID: <498341CA.2020506@gmail.com> Date: Fri, 30 Jan 2009 13:07:06 -0500 From: Mark Miller User-Agent: Thunderbird 2.0.0.19 (X11/20090105) MIME-Version: 1.0 To: solr-user@lucene.apache.org Subject: Re: query with stemming, prefix and fuzzy? References: <497F3AD7.5070300@netcologne.de> <4981E9E4.3020100@gmail.com> <498317BE.5000206@netcologne.de> In-Reply-To: <498317BE.5000206@netcologne.de> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Gert Brinkmann wrote: > Thanks, Mark, for your answer, > > Mark Miller wrote: > >> Truncation queries and stemming are difficult partners. You likely have >> to accept compromise. You can try using multiple fields like you are, >> > > I already have multiple fields, one per language, to be able to use > different stemmers. Wouldn't become this too much? > Possibly. Especially if you are using norms with all of those fields. Depends on your index though. > >> you can try indexing the full term at the same position as the stemmed >> term, >> > > what does this mean "at the same position" and how could I do this? > Write a custom filter. Normally, for every term, its position is incremented by 1 as the terms are broken out in tokenization. You can change this and index terms at the same position using your own filter. There are ramifications, because you are adding more terms to your index, but it allows you to index multiple forms of a term at the same position (so that phrase queries still work as expected). > >> or you can accept the weirdness that comes from matching on a >> stemmed form (potentially very confusing for a user). >> > > Currently I think about dropping the stemming and only use > prefix-search. But as highlighting does not work with a prefix "house*" > this is a problem for me. The hint to use "house?*" instead does not > work here. > Thats because wildcard queries are also not highlightable now. I actually have somewhat of a solution to this that I'll work on soon (I've gotten the ground work for it in or ready to be in Lucene). No guarantee on when or if it will be accepted in solr though. > >> In any case though, a queryparser that support fuzzyquery should not be >> analyzing it. What parser are you using? If it is analyzing the fuzzy >> syntax, it doesnt likely support it. >> > > I am using the following definitions (testing it with and without stemming): > >> >> >> >> >> >> > ignoreCase="true" >> words="stopwords_de_de.txt" >> enablePositionIncrements="true" >> /> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > > and, well, the parser? Where is the parser specified? Do you mean the > request handler "qt" (that will be "standard", as I do not set it yet)? > Thats odd. I'll have to look at this closer to be of help. > > >> The prefix length determines how many terms are enumerated - with the >> > > Can the prefix length be set in Solr? I could not find such an option. > I don't think there is an option in Solr. Patches welcome of course. It would be a nice one - using the default of 0 is *very* not scalable. > >> The latest trunk build on Lucene will let us switch fuzzy query to use a >> constant score mode - this will eliminate the booleanquery and should >> perform much better on a large index. Solr already uses a constant score >> mode for Prefix and Wildcard queries. >> > > much better performance is always good. When will this feature be > available in Solr? > Soon I hope. Since wildcard and prefix are already constant score, it only makes sense to make fuzzy query that way as well. > >> How big is your index? If its not that big, it may be odd that your >> seeing things that slow (number of unique terms in the index will play a >> large role). >> > > Well, the index currently contains about 5000 documents. These are > HTML-pages, some of them are concatenated with PDF/DOCs (Downloads > linked from the HTML-page) converted to text. The index data is about > 11MB (optimized). So think, this is just a smaller index. > Yeah, sounds small. Its odd you would see such slow performance. It depends though. You may still have a *lot* of unique terms in there.