Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 64653 invoked from network); 31 Jul 2009 07:27:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 31 Jul 2009 07:27:27 -0000 Received: (qmail 83526 invoked by uid 500); 31 Jul 2009 07:27:25 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 83433 invoked by uid 500); 31 Jul 2009 07:27:25 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 83423 invoked by uid 99); 31 Jul 2009 07:27:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Jul 2009 07:27:25 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [206.190.49.11] (HELO web52901.mail.re2.yahoo.com) (206.190.49.11) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 31 Jul 2009 07:27:15 +0000 Received: (qmail 44007 invoked by uid 60001); 31 Jul 2009 07:26:53 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1249025213; bh=3+aPJiitI+7N9NR1hyn33oAa0N0VnwkQRTHXlCQnTDc=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=6mWqOYOQohk8GL1ne4gR0X6CooMSuQQHVHtPM0oNpPjG0SiOY6gDn7VoeRZg1KwMrclcM27WAKCghXU3Qy/v8uaJOdKyBP78dpzMlAYOESljSMiTN4RBeWrN5hXy1xNSqBM7BrOsP0iKfXGa/cg79jt3DaK3aLXWQwhNe17kGiA= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=lrLISdTcG7sBcwf9FqpMRDwjSmMvp8Dsr3dtHabXclnND1UujeJXvnjYv2ji86hXdWaMGJoYyDdxgTnoy6QxJam8EzmSxG/dzXVUjgq/iBmzM3mh45CqAKbg6CO6eQcjhbumF4zKRnEN9wV5IJpQXf+NQS2SlEb5V0OQm4GoCXU=; Message-ID: <605791.43324.qm@web52901.mail.re2.yahoo.com> X-YMail-OSG: FVUWp18VM1nxbjlTnyJq8l3ynTWK0A4k0wE.C6Au5gr1kJ7Lb.9ytGLBq1Ow7YStc87hCTziqqB30Qnc5puDVC0B_l5tYRGO.QeKaF.D5sDEBAvYLxyt_mKGIemvH57Z7LmDviAYu2.HQPqES8jUbO5h10N14zHFgVSCEnXbOgHVTXIUKVKTxHYxFtV38MtXvoPww3W2XBKRGlNYWoFove01jf0EbKdu8PGeC9hm.USy4Cfbc3SrpkOBybdXs8tG.Q_BZXFYfcBOZNOHcSpSyQ3MuXe7vH.anbB2nDfX8QAiVCSTrllakHdR8tnW8UyRA4oRQyw- Received: from [193.140.184.100] by web52901.mail.re2.yahoo.com via HTTP; Fri, 31 Jul 2009 00:26:53 PDT X-Mailer: YahooMailClassic/6.0.19 YahooMailWebService/0.7.289.15 Date: Fri, 31 Jul 2009 00:26:53 -0700 (PDT) From: AHMET ARSLAN Subject: Re: Is there a list of "special" characters for standard analyzer? To: java-user@lucene.apache.org In-Reply-To: <20090731020249.56MKO.13352.imail@eastrmwml39> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Virus-Checked: Checked by ClamAV on apache.org > I guess that the obvious question is "Which characters are > considered 'punctuation characters'?". Punctuation = ("_"|"-"|"/"|"."|",") > In particular, does the analyzer consider "=" (equal) and > ":" (colon) to be punctuation characters? ":" is special character at QueryParser (if you are using it). If you want to search it you need to escape it first. At index time this character is ignored. Like the punctuations. The string ahmet:arslan will produce two tokens ahmet and arslan. It also breaks words at "=" character in both query/index time. If you want to understand the behavior of StandardTokenizer, you need to look at the file StandardTokenizerImpl.jflex. It recognizes the followings as one token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, {CJ}, {ACRONYM_DEP} and ignores the rest. There are some definitions of these token types, similar to Regular Expression. You can change behavior of StandardTokenizer by editing this file and generating StandardTokenizerImpl.java from it. There is also another jflex file named WikipediaTokenizerImpl.jflex. By looking it you can understand how new token types can be added. Ahmet --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org