Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 60881 invoked from network); 7 Dec 2005 20:25:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 7 Dec 2005 20:25:39 -0000 Received: (qmail 47859 invoked by uid 500); 7 Dec 2005 20:25:31 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 47607 invoked by uid 500); 7 Dec 2005 20:25:30 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 47596 invoked by uid 99); 7 Dec 2005 20:25:30 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Dec 2005 12:25:30 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [69.55.225.129] (HELO ehatchersolutions.com) (69.55.225.129) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Dec 2005 12:25:28 -0800 Received: by ehatchersolutions.com (Postfix, from userid 504) id 9672A13E2006; Wed, 7 Dec 2005 15:25:07 -0500 (EST) Received: from [128.143.167.108] (d-128-167-108.bootp.Virginia.EDU [128.143.167.108]) by ehatchersolutions.com (Postfix) with ESMTP id 17C0B13E2006 for ; Wed, 7 Dec 2005 15:25:00 -0500 (EST) Mime-Version: 1.0 (Apple Message framework v746.2) In-Reply-To: References: Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: words with more than 1 hyphen ? Date: Wed, 7 Dec 2005 15:24:56 -0500 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.746.2) X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on javelina X-Spam-Level: X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.1 X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N O > 1. I modified the StandardTokenizer.jj file. > > Essentially, I added the following to StandardTokenizer.jj > | )+"-"()+("-")*> Is that the only change you made to the .jj file? Where did you put that exactly? Don't you need a * after the second ? > 4. I was able to index and retrieve words like > merry-go-round (as oppose to merry go round). So, I > was quite happy. > Now I want to get "merry-go-round" from the token > stream. And that doesn't seem to work. > Note that retrieve words with 1 hyphen seems to work, > but 2 hyphens seems to represent a problem. > > In getting the tokens from the stream, I get > "Merry-go-r" and "ound" instead of "Merry-go-round" > "editor-in-c" and "hief" instead of "editor-in-chief". > This behaviour is so strange, and I don't know how > the indexer and query processing knows about "merry-go-round", > and yet the TokenStream doesn't. I think the missing * above explains what you're seeing. > "green-monster" would work. But not words with more than > one hyphen. I'm surprised this one worked - maybe some other token in JavaCC is catching that? JavaCC is perhaps overkill for what you want. If you don't need any of the other fancy analysis tricks that StandardTokenizer has, you could just use WhiteSpaceAnalyzer, LowerCaseFilter, voila, your hyphenated tokens would come right out. > (By the way, currently, I convert a hyphenated word into a phrase, > but to me, that seems like special casing hyphenated words, and I > just want to stay away from special casing. People has been asking > for all sorts of punctuation, such as _ or / etc. I thought that > if I learn > how to do modify the .jj files and produce the right tokens, I am > better > off. Unless you need the other features of StandardTokenizer, you may be best staying away from JavaCC altogether. It is it's own complex world that might be more than what you need. Erik --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org