From java-user-return-48647-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Tue Feb 08 16:51:34 2011 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 90856 invoked from network); 8 Feb 2011 16:51:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Feb 2011 16:51:34 -0000 Received: (qmail 28971 invoked by uid 500); 8 Feb 2011 16:51:29 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 28118 invoked by uid 500); 8 Feb 2011 16:51:25 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 28108 invoked by uid 99); 8 Feb 2011 16:51:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Feb 2011 16:51:23 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of larsmans@gmail.com designates 209.85.161.48 as permitted sender) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Feb 2011 16:51:18 +0000 Received: by fxm2 with SMTP id 2so6566361fxm.35 for ; Tue, 08 Feb 2011 08:50:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to :content-type; bh=fQGyGNeuzpDsABPm+GgLC8h7cQ/rlg3OgNNdBs8a6nc=; b=HY0F+5VkjahUNAwRvImWDOMYy6pMUou6i0GinHUF3EUiMX555qXiRuvyZdP+A5hhWY QROuM71utODXDzp3++XYbQozRSCDGoY+y/8zn+6ba7cCnz2efFaLPzk9XDz5ggClulyU YD6LT5ACpIDlDpTiaTdm/9FFSLHmNs5AdLaLU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=BYo1l7+KqW3OPvCo87k1U7pbt01GHq0WoFdMtmIPKkMMXwJTJ11SXvb2ShyixYM9+U bM4itLvMbQzTdQjHB1FgtvXwIazVFGdeTsLZAwocM24IPUQIwduaCLi+nSfjbzdpPzoX ePtJ1ls8gh5FdJrlpUnBam2U7Q6vHEzD+9qTc= MIME-Version: 1.0 Received: by 10.223.96.68 with SMTP id g4mr2831212fan.33.1297183857026; Tue, 08 Feb 2011 08:50:57 -0800 (PST) Received: by 10.223.119.65 with HTTP; Tue, 8 Feb 2011 08:50:56 -0800 (PST) Date: Tue, 8 Feb 2011 17:50:56 +0100 Message-ID: Subject: [ANNOUNCEMENT] NLP-based Analyzer library for Lucene From: Lars Buitinck To: Apache Lucene users Content-Type: text/plain; charset=UTF-8 Dear all, For anyone wanting to add some NLP abilities to Lucene, I've released a small library at https://github.com/larsmans/lucene-stanford-lemmatizer . This library performs part-of-speech tagging (determining word categories such as noun, verb), filtering based on part-of-speech and lemmatizing (reducing words to their base form). In other words: this is an NLP-based replacement for a stemmer and a stop list, implemented as a Lucene analyzer. It requires the Stanford POS Tagger. lucene-stanford-lemmatizer can be used to index or query lemmas as well as the terms as they appear in text, and/or to filter out terms before indexing/querying based on their part-of-speech. By default, it filters out pronouns, determiners (the, a) and several other non-informative word categories. I've seen this code improve search quality, even on very noisy data. The software is designed for English, but does a pretty good job at detecting non-English words and leaving those alone (in contrast to the Porter/Snowball stemmer). Regards, Lars Buitinck --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org