Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 68589 invoked from network); 6 Oct 2009 12:12:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Oct 2009 12:12:26 -0000 Received: (qmail 32584 invoked by uid 500); 6 Oct 2009 12:12:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 32499 invoked by uid 500); 6 Oct 2009 12:12:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 32489 invoked by uid 99); 6 Oct 2009 12:12:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Oct 2009 12:12:24 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of vcomandur@gmail.com designates 209.85.216.204 as permitted sender) Received: from [209.85.216.204] (HELO mail-px0-f204.google.com) (209.85.216.204) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Oct 2009 12:12:15 +0000 Received: by pxi42 with SMTP id 42so3730318pxi.5 for ; Tue, 06 Oct 2009 05:10:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=MSXMEWGl8hbmH+ScEVLcUDQdiGq275htCd9U/89irkc=; b=OfM40gjIkVZBKUkSETZ0k+gTniBK+qcOpRKc/at0hHXOsHCAhRzjz2Z9eFLxks2gzZ J6JXo4uBvFmvylddagSL4K4BS4lRdn+BydF92ucwHQk3+ZZSbYeTnQKLhD9lVOlxd05M Sv7PC6B8tEBrSZ13IUuLitLBV2NBojNRkw++c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Q3hO6nXCq0m6+FdpdE/s1MXKjZVWLk240c3evFOBJ0N+fEMNOVs5p81TdVg1VAlCDP uRWcWbk5kNNBV+7qTRb2ItigerARnTr/Ro4c/Zy13zRHj2uw975f/hGNbQRvCpQWCnhJ 4rMSVKD4N461HLMDuTz20LARr9RHIvRl7CVa8= MIME-Version: 1.0 Received: by 10.142.3.3 with SMTP id 3mr104396wfc.141.1254831054590; Tue, 06 Oct 2009 05:10:54 -0700 (PDT) In-Reply-To: <8680aede0910060442o5a0b6058rc30139af13b90265@mail.gmail.com> References: <8680aede0910060442o5a0b6058rc30139af13b90265@mail.gmail.com> Date: Tue, 6 Oct 2009 17:40:54 +0530 Message-ID: Subject: Re: Phase Extraction, mainly for English From: Vasudevan Comandur To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016368e270443160d0475431fb5 X-Virus-Checked: Checked by ClamAV on apache.org --0016368e270443160d0475431fb5 Content-Type: text/plain; charset=ISO-8859-1 Hi, Take the NLP route and use modules like POS tagger and NP chunker. OpenNLP has a stack for English language. Try to use them. Regards Vasu On Tue, Oct 6, 2009 at 5:12 PM, Andrew Zhang wrote: > Hi guys, > > The requirement is very simple here, e.g. for this sentence, 'The NBA > formally announced its new *social media* guidelines Wednesday', I want to > treat '*social media*' as a whole phase term. The default english analyzers > came with lucene all deal with single word, so it you want to get the most > frequent terms, *social *and *media* are separated, and each of them can't > represent a good meaning as *social media*, right? > > I know there's a way built on some phase dictionary, and try to match the > phase already there, very like the way to do with chinese language, but is > there an open source solution for english, I mean I don't want to build a > phase dictionary myself, and I also want a smart way, which can "discover" > the phase automatically. I got 2 millions docs analyzered the norma way, > all > single terms, which I can use as a base source, and it's possible to find > that *social media *came together frequently, but I really don't know > what's > the reverse way. > > I tried to find some phase analyzers, but no luck. so any advices? > > Regards, > Andrew > -- > Simple is best > --0016368e270443160d0475431fb5--