Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 59908 invoked from network); 6 Oct 2009 11:42:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Oct 2009 11:42:56 -0000 Received: (qmail 94035 invoked by uid 500); 6 Oct 2009 11:42:54 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 93963 invoked by uid 500); 6 Oct 2009 11:42:53 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 93953 invoked by uid 99); 6 Oct 2009 11:42:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Oct 2009 11:42:53 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rooseve6891@gmail.com designates 209.85.216.204 as permitted sender) Received: from [209.85.216.204] (HELO mail-px0-f204.google.com) (209.85.216.204) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Oct 2009 11:42:44 +0000 Received: by pxi42 with SMTP id 42so3712774pxi.5 for ; Tue, 06 Oct 2009 04:42:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=bWCN0aNE3rhoQ9mGm+CMfQ0q/coBleKJ1hzKqBuT8R0=; b=DRC0CP0mEykG6a5634+MNpIC3nccXjDaltrhfA2eciTcfucw5GrFf2MKD/Xwe2rz4N kbFOa+jd8FSw7vzni+Ocq+WK4univwsjfzzcXF2Xt7A9edqLxP/cDFWIuYVGO/NMKuso zBGWoa2W+/AvjvPHG8mWDofoBFeGEcx2r//bY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=V4xTRMtCRelj+SBb4YLTVXMufjEG4M2szH7raGBxRlUpVKH6MOR/ZvgjTqFSCO5v5Z NYjCxYlx2wj+qKhpTEiXrSjNl4Ju8u8r+2YpTLB3hsxyRsFYUv0gekPZnOmaHFOdyPVs K0EMQqTJfK8Xc/G9XkUycf4aif4BqD96xQhb0= MIME-Version: 1.0 Received: by 10.115.133.39 with SMTP id k39mr2172676wan.94.1254829343024; Tue, 06 Oct 2009 04:42:23 -0700 (PDT) Date: Tue, 6 Oct 2009 19:42:23 +0800 Message-ID: <8680aede0910060442o5a0b6058rc30139af13b90265@mail.gmail.com> Subject: Phase Extraction, mainly for English From: Andrew Zhang To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e6477a943ea6c9047542b907 X-Virus-Checked: Checked by ClamAV on apache.org --0016e6477a943ea6c9047542b907 Content-Type: text/plain; charset=ISO-8859-1 Hi guys, The requirement is very simple here, e.g. for this sentence, 'The NBA formally announced its new *social media* guidelines Wednesday', I want to treat '*social media*' as a whole phase term. The default english analyzers came with lucene all deal with single word, so it you want to get the most frequent terms, *social *and *media* are separated, and each of them can't represent a good meaning as *social media*, right? I know there's a way built on some phase dictionary, and try to match the phase already there, very like the way to do with chinese language, but is there an open source solution for english, I mean I don't want to build a phase dictionary myself, and I also want a smart way, which can "discover" the phase automatically. I got 2 millions docs analyzered the norma way, all single terms, which I can use as a base source, and it's possible to find that *social media *came together frequently, but I really don't know what's the reverse way. I tried to find some phase analyzers, but no luck. so any advices? Regards, Andrew -- Simple is best --0016e6477a943ea6c9047542b907--