Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 61176 invoked from network); 17 Jul 2010 20:30:27 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Jul 2010 20:30:27 -0000 Received: (qmail 393 invoked by uid 500); 17 Jul 2010 20:30:25 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 229 invoked by uid 500); 17 Jul 2010 20:30:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 221 invoked by uid 99); 17 Jul 2010 20:30:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Jul 2010 20:30:24 +0000 X-ASF-Spam-Status: No, hits=-1.6 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [134.226.32.56] (HELO scss.tcd.ie) (134.226.32.56) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Jul 2010 20:30:15 +0000 Received: from localhost (localhost [127.0.0.1]) by hermes.scss.tcd.ie (Postfix) with ESMTP id 93FDA3E4089 for ; Sat, 17 Jul 2010 21:29:25 +0100 (IST) X-Virus-Scanned: Debian amavisd-new at scss.tcd.ie Received: from scss.tcd.ie ([127.0.0.1]) by localhost (scss.tcd.ie [127.0.0.1]) (amavisd-new, port 10027) with ESMTP id SCopRADNYC9G for ; Sat, 17 Jul 2010 21:29:25 +0100 (IST) Received: from [192.168.1.10] (unknown [109.255.117.30]) by smtp.scss.tcd.ie (Postfix) with ESMTPSA id 62B713E4082 for ; Sat, 17 Jul 2010 21:29:25 +0100 (IST) Message-ID: <4C4212A3.6070508@webscio.net> Date: Sat, 17 Jul 2010 21:29:23 +0100 From: Martin User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.10) Gecko/20100527 Thunderbird/3.0.5 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: A full-text tokenizer for the NGramTokenFilter References: <1279398251.98994.ezmlm@lucene.apache.org> In-Reply-To: <1279398251.98994.ezmlm@lucene.apache.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi there, I have been recently trying to build a lucene index out of ngrams and seem to have stumbled on to a number of issues. I first tried to use the NGramTokenizer, but that thing apparently only takes the first 1024 characters to tokenize. Having searched around the web, I came upon this issue being discussed a couple of years ago and the proposed solution there has been using the NGramTokenFilter. Now that filter certainly works, but it needs an underlying tokenizer to work with, and I'm just wondering if there is a tokenizer that would return me the whole text. The reason I can't use something like the StandardTokenizer is that ngrams should really include spaces and pretty much every tokenizer gets rid of them. Thank you very much in advance for any suggestions. Regards, Martin --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org