lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Cunningham <>
Subject Re: Word co-occurrences counts
Date Fri, 24 Dec 2004 05:40:05 GMT
Thanks Doug and all,

I'm intending to use Lucene to grab a lot of word co-occurance 
statistics out of a large corpus
to perform word disambiguation. Lucene's looking like a great option, 
but I appear to have hit
a snag. Here's my understanding:

1) Create a Similarity implementation, where:
        tf() returns freq
    sloppyFreq, idf, coord, return 1 (cause we only need to freq to score)
2) Perform the query
3) and then:
        word in document count = 
4) A query call such as
        "computer dog"~50
    will return a count of 2 (I assume because the match occurs 
backwards and forwards).

My problem occurs when I have the following in a text file:
    computer ...(some words)... dog ...(some words)... computer
and I duplicate the text file several times over. Performing a the above 
query will return different
phrase counts per document?

Note: I'm just working with some modified demo code at the moment.

Thanks again,

Doug Cutting wrote:

> Andrew Cunningham wrote:
>> "computer dog"~50 looks like what I'm after - now is there someway I 
>> can call this and pull
>> out the number of total occurances, not just the number of documents 
>> hits? (say if computer
>> and dog occur near each other several times in the same document).
> You could use a custom Similarity implementation for this query, where 
> tf() is the identity function, idf() returns 1.0, etc., so that the 
> final score is the occurance count.  You'll need to divide by 
> Similarity.decodeNorm(indexReader.norms("field")[doc]) at the end to 
> get rid of the lengthNorm() and field boost (if any).
> Doug
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message