Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 94808 invoked from network); 14 Jul 2006 16:13:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 14 Jul 2006 16:13:57 -0000 Received: (qmail 73687 invoked by uid 500); 14 Jul 2006 16:13:49 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 73649 invoked by uid 500); 14 Jul 2006 16:13:49 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 73638 invoked by uid 99); 14 Jul 2006 16:13:49 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jul 2006 09:13:49 -0700 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=DNS_FROM_RFC_ABUSE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of hefest@gmail.com designates 64.233.166.181 as permitted sender) Received: from [64.233.166.181] (HELO py-out-1112.google.com) (64.233.166.181) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jul 2006 09:13:48 -0700 Received: by py-out-1112.google.com with SMTP id c39so711981pyd for ; Fri, 14 Jul 2006 09:13:28 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=k4jNqlNZqnNYnbDtsERyI5tw0y2H3KT3J5N8jLtGKV0+5mH0KcL+sMAQ4r8UgsMOGOoAcb2FcRbQ59+56MaelgX3UlzapD2bfV3WeCVjmkbdzI3OUg75sbhMdEXlWhKSNaXckZY7kxNPL7whuIVzMMKm16rUfJC9rsIlgrsdZ9g= Received: by 10.35.134.12 with SMTP id l12mr2158534pyn; Fri, 14 Jul 2006 09:13:28 -0700 (PDT) Received: by 10.35.54.16 with HTTP; Fri, 14 Jul 2006 09:13:28 -0700 (PDT) Message-ID: Date: Fri, 14 Jul 2006 18:13:28 +0200 From: "Tomi NA" To: java-user@lucene.apache.org, "Otis Gospodnetic" Subject: Re: accented characters, wildcards and other problems In-Reply-To: <20060713165358.5879.qmail@web50315.mail.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20060713165358.5879.qmail@web50315.mail.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On 7/13/06, Otis Gospodnetic wrote: > Bok Tomi, > > What do you mean by "terms are misrepresented"? What should they be, and what are you seeing? I mean 3/5 accented characters appear in the index with accents correctly displayed, but the remaining 2 accented characters appear as characters I don't know how to pronounce or what they're called - somewhere along the line some kind of encoding/decoding process mistakenly assumes the data is encoded in a certain way. Update: I've managed to solve the problem localy (when I index a test directory with accented characters on my ext3 partition), but when I try indexing a directory I access via a samba mount, I'm stuck with the old problem again. Could be the iocharset, although there are 2 other encoding related settings which might cause the problem. > > What I'm not clear on is how can I see the problematic *terms* in the list of terms, but not the documents they're stored in? > > Are you saying that the content got indexed, but the file names did not? I'm saying that I expect to see a list of indexed documents in the "documents" list, and I don't see the documents containing the problematic accented characters. However, I see the terms with the problematic accented characters, although they are missrepresented. > Out of curiosity (note my last name), I'm curious about what analyzer/tokenizer you're using. Is there an equivallent of Porter stemmer for Croatian? I could use that. :) I'm very new to the technology, so I'm using whatever nutch is using by default. As far as the stemmer's concerned, I'd say that wildcards go a long way in providing the necessairy functionality, probably even better than automatic stemming. However, I apreciate the fact that most users' minds don't come with an inbuilt regexp constructor. :) As far as Croatian is concerned, a stemming database was developed just recently (by "completed" I mean "a usable language coverage") at the department of Croatian language studies (for want of a better word)...the problem is, however, it's not publicly available. You see, when I pay my taxes out of which their salaries are paid, it doesn't seem to obligate them produce value to me as their indirect invester. But that's something I'd like to say to their faces, with just a tad more feeling. ;) t.n.a. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org