Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <4C4212A3.6070508@webscio.net>
Date: Sat, 17 Jul 2010 21:29:23 +0100
From: Martin <martin@webscio.net>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US;
 rv:1.9.1.10) Gecko/20100527 Thunderbird/3.0.5
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: A full-text tokenizer for the NGramTokenFilter
References: <1279398251.98994.ezmlm@lucene.apache.org>
In-Reply-To: <1279398251.98994.ezmlm@lucene.apache.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi there,

I have been recently trying to build a lucene index out of ngrams and 
seem to have stumbled on to a number of issues. I first tried to use the 
NGramTokenizer, but that thing apparently only takes the first 1024 
characters to tokenize. Having searched around the web, I came upon this 
issue being discussed a couple of years ago and the proposed solution 
there has been using the NGramTokenFilter. Now that filter certainly 
works, but it needs an underlying tokenizer to work with, and I'm just 
wondering if there is a tokenizer that would return me the whole text. 
The reason I can't use something like the StandardTokenizer is that 
ngrams should really include spaces and pretty much every tokenizer gets 
rid of them.

Thank you very much in advance for any suggestions.

Regards,
Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org