Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAGDa55ivVt89aUViof3=9OzN6EuBMFBFfq11QOJFMr=cOBQX-g@mail.gmail.com>
References: 
 <CAGDa55gmrkBNc1bB7Mcu=EU++nB25-v3dkSb5MZVg9EH3yjcCg@mail.gmail.com>
	<1272846838.3077987.1434376472830.JavaMail.yahoo@mail.yahoo.com>
	<CAGDa55ivVt89aUViof3=9OzN6EuBMFBFfq11QOJFMr=cOBQX-g@mail.gmail.com>
Date: Mon, 15 Jun 2015 08:10:04 -0700
Message-ID: 
 <CAN4YXvcx+hM1W9PqN3VC67bCmb6zXZS18xG274pwJn=F3UFUqA@mail.gmail.com>
Subject: Re: Tf and Df in lucene
From: Erick Erickson <erickerickson@gmail.com>
To: java-user <java-user@lucene.apache.org>
Content-Type: text/plain; charset=UTF-8

In a word, no. Terms are, by definition, whatever a "token" is.
Tokens are delimited by, say, the WhitespaceTokenizer
so a-priori can't do what you want.

Unless... you do "something special". In this case, "something special"
would be put shingles (See ShingleFilter in Lucene or
ShingleFilterFactory in Solr). That will make your index bigger,
but will put things like free_speech_zones in your index as a
single token which you could then allow you to get what you're asking
for.

Best,
Erick

On Mon, Jun 15, 2015 at 7:49 AM, Shay Hummel <shay.hummel@gmail.com> wrote:
> Hi Ahmet
>
> Thank you for the reply.
> Can the term reflect a multi word expression?
> For example:
> I want to find the term frequency \ document frequency of "united states"
> (two terms) or "free speech zones" (three terms).
>
> Shay
>
> On Mon, Jun 15, 2015 at 4:55 PM Ahmet Arslan <iorixxx@yahoo.com.invalid>
> wrote:
>
>> Hi Hummel,
>>
>> regarding df,
>>
>> Term term = new Term(field, word);
>> TermStatistics termStatistics = searcher.termStatistics(term,
>> TermContext.build(reader.getContext(), term));
>> System.out.println(query + "\t totalTermFreq \t " +
>> termStatistics.totalTermFreq());
>> System.out.println(query + "\t docFreq \t " + termStatistics.docFreq());
>>
>> regarding tf,
>>
>> Term term = new Term(field, word);
>> Bits bits = MultiFields.getLiveDocs(reader);
>> PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(reader, bits,
>> field, term.bytes());
>>
>> if (postingsEnum == null) return;
>>
>> int max = 0;
>> while (postingsEnum.nextDoc() != PostingsEnum.NO_MORE_DOCS) {
>> final int freq = postingsEnum.freq();
>> int docID = postingsEnum.docID();}
>>
>>
>> Ahmet
>>
>>
>>
>>
>> On Monday, June 15, 2015 9:12 AM, Shay Hummel <shay.hummel@gmail.com>
>> wrote:
>> Hi
>>
>> I was wondering, what is the easiest way to get the term frequency of a
>> term t in document d, namely tf(t,d) ?
>> In the same spirit - what is the easieast way the get the document
>> frequency of a term in the collection, i.e. how many contain the term t,
>> namely df(t) ?
>>
>> Regards,
>> Shay
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org