Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6E62B17CFC for ; Mon, 15 Jun 2015 15:10:09 +0000 (UTC) Received: (qmail 9463 invoked by uid 500); 15 Jun 2015 15:10:07 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 9408 invoked by uid 500); 15 Jun 2015 15:10:07 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 9390 invoked by uid 99); 15 Jun 2015 15:10:06 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Jun 2015 15:10:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 4A183CDFB5 for ; Mon, 15 Jun 2015 15:10:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.137 X-Spam-Level: X-Spam-Status: No, score=-0.137 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_MSPIKE_H2=-0.036, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 6bFL2oi-j3Lj for ; Mon, 15 Jun 2015 15:10:05 +0000 (UTC) Received: from mail-ie0-f174.google.com (mail-ie0-f174.google.com [209.85.223.174]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 8646322F22 for ; Mon, 15 Jun 2015 15:10:05 +0000 (UTC) Received: by iesa3 with SMTP id a3so64388220ies.2 for ; Mon, 15 Jun 2015 08:10:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=zM2gSHTPoWuHQ9cL9iQdlXZ0Jno4EzmAsmioStXzCtY=; b=FNemUQEFLysrqsJqIFRhMQOtl75KcLwb5Lsi0Bvs7LaicCAiciOgy2d6/HJOJm5xMU jZdjhazaeppbYSECl3NBB0Iu8kEZqKFmVMP9w8luqH+9kiwBQoiEEexWt5H9/diUWlzW L/nNeA6he4jdDKyljT/8NXlz45EQI83JdE/gL7ElC+lQtBnxaA/+NUEG9OpPNIGqvgZ5 i6X8eLQKF7ILW9fKpsgKq7ltVWxJTv6fPCrvI5aJFCzPCttxjIzCiv4oeY8EsHgXhZE2 RNrite2mdcnJcfZ13LThD7RNEjADpan5+puxNXxtoXH2h2xZMK8UtrP9XYX1mMRLHmDD +YQw== MIME-Version: 1.0 X-Received: by 10.50.178.133 with SMTP id cy5mr21280280igc.5.1434381005034; Mon, 15 Jun 2015 08:10:05 -0700 (PDT) Received: by 10.107.181.146 with HTTP; Mon, 15 Jun 2015 08:10:04 -0700 (PDT) In-Reply-To: References: <1272846838.3077987.1434376472830.JavaMail.yahoo@mail.yahoo.com> Date: Mon, 15 Jun 2015 08:10:04 -0700 Message-ID: Subject: Re: Tf and Df in lucene From: Erick Erickson To: java-user Content-Type: text/plain; charset=UTF-8 In a word, no. Terms are, by definition, whatever a "token" is. Tokens are delimited by, say, the WhitespaceTokenizer so a-priori can't do what you want. Unless... you do "something special". In this case, "something special" would be put shingles (See ShingleFilter in Lucene or ShingleFilterFactory in Solr). That will make your index bigger, but will put things like free_speech_zones in your index as a single token which you could then allow you to get what you're asking for. Best, Erick On Mon, Jun 15, 2015 at 7:49 AM, Shay Hummel wrote: > Hi Ahmet > > Thank you for the reply. > Can the term reflect a multi word expression? > For example: > I want to find the term frequency \ document frequency of "united states" > (two terms) or "free speech zones" (three terms). > > Shay > > On Mon, Jun 15, 2015 at 4:55 PM Ahmet Arslan > wrote: > >> Hi Hummel, >> >> regarding df, >> >> Term term = new Term(field, word); >> TermStatistics termStatistics = searcher.termStatistics(term, >> TermContext.build(reader.getContext(), term)); >> System.out.println(query + "\t totalTermFreq \t " + >> termStatistics.totalTermFreq()); >> System.out.println(query + "\t docFreq \t " + termStatistics.docFreq()); >> >> regarding tf, >> >> Term term = new Term(field, word); >> Bits bits = MultiFields.getLiveDocs(reader); >> PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(reader, bits, >> field, term.bytes()); >> >> if (postingsEnum == null) return; >> >> int max = 0; >> while (postingsEnum.nextDoc() != PostingsEnum.NO_MORE_DOCS) { >> final int freq = postingsEnum.freq(); >> int docID = postingsEnum.docID();} >> >> >> Ahmet >> >> >> >> >> On Monday, June 15, 2015 9:12 AM, Shay Hummel >> wrote: >> Hi >> >> I was wondering, what is the easiest way to get the term frequency of a >> term t in document d, namely tf(t,d) ? >> In the same spirit - what is the easieast way the get the document >> frequency of a term in the collection, i.e. how many contain the term t, >> namely df(t) ? >> >> Regards, >> Shay >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org