Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 26067 invoked from network); 12 Mar 2004 02:05:23 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 12 Mar 2004 02:05:23 -0000 Received: (qmail 52978 invoked by uid 500); 12 Mar 2004 02:05:04 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 52964 invoked by uid 500); 12 Mar 2004 02:05:04 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 52949 invoked from network); 12 Mar 2004 02:05:03 -0000 Received: from unknown (HELO localhost) (217.215.169.239) by daedalus.apache.org with SMTP; 12 Mar 2004 02:05:03 -0000 Received: (qmail 5853 invoked from network); 12 Mar 2004 01:54:23 -0000 Received: from unknown (HELO localhost) (127.0.0.1) by localhost with SMTP; 12 Mar 2004 01:54:23 -0000 Date: Fri, 12 Mar 2004 02:54:22 +0100 From: karl wettin To: "Lucene Developers List" Subject: Re: N-gram layer Message-Id: <20040312025422.43afbcdb.kalle@snigel.dnsalias.net> In-Reply-To: <20040201211232.45632.qmail@web12707.mail.yahoo.com> References: <20040201220713.56cd7980.kalle@snigel.dnsalias.net> <20040201211232.45632.qmail@web12707.mail.yahoo.com> Organization: snigel heavy industries X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.10; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N On Sun, 1 Feb 2004 13:12:32 -0800 (PST) Otis Gospodnetic wrote: > Looking forward to the contribution. Sorry for the delay, but I've had quite some workload lately, and then I moved between apartments. I'm back and I'm ready to spend some time. I gave up detecting the language of a query. It is very possbile indeed and I got great results with Weka, but takes too much time: 5-50 seconds on my Pentium M. However, I'm still working on the "autoanalytic stemmer", atleast in my head. I've started to feed my index with docuemnts tagged with the language in a field, and thought it should analyze (still the n-gram approach) all words of a specific language to find stemming rules for each and every language. The output can be used per language stemming, BUT hopefully I'll be able to use this data to create my generic stemmer. The language models and inflectional form extraction should be based on the index content, but I can't seem to find out how to access the terms of a specific set of documents. Of course, I could just query my index and start working on the data, building my own trie-pattern, but I'm sure I don't have to. I've been browsing the list archives and API for several days without finding out how to iterate the (distinct/unique) terms of the index or a specific set of documents. How do I do that? -- karl --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org