lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom Burton-West (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index
Date Wed, 12 May 2010 22:17:42 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tom Burton-West updated LUCENE-2393:
------------------------------------

    Attachment: LUCENE-2393.patch

Rewrote argument processing so the default behavior is that of HighFreqTerms.  The field and
number of terms are now both optional with the default being all fields and 100 terms (same
default as currrent HighFreqTerms).  If a -t flag is used the totalTermFreq stats will be
read,calculated, and displayed. 

The bug surfaced when not specifying a field.  Added test data with multiple fields and tests
to check that correct results are returned with and without a field being specified.  Fixed
bug and new tests pass.

With the increasing number of options, I started thinking about more robust command line argument
processing.  I'm used to languages where there is a commonly used Getopt(s)  library.  There
appear to be several for Java with different features, different levels of active development
and different licenses. Is it worth the overhead of using one, and if so which one would be
the best to use?

Tom


> Utility to output total term frequency and df from a lucene index
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2393
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2393
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Tom Burton-West
>            Priority: Trivial
>         Attachments: LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch,
LUCENE-2393.patch, LUCENE-2393.patch, LUCENE-2393.patch
>
>
> This is a pair of command line utilities that provide information on the total number
of occurrences of a term in a Lucene index.  The first  takes a field name, term, and index
directory and outputs the document frequency for the term and the total number of occurrences
of the term in the index (i.e. the sum of the tf of the term for each document).   The second
reads the index to determine the top N most frequent terms (by document frequency) and then
outputs a list of those terms along with  the document frequency and the total number of occurrences
of the term. Both utilities are useful for estimating the size of the term's entry in the
*prx files and consequent Disk I/O demands. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message