lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <>
Subject Re: SIPs and CAPs
Date Thu, 14 Jul 2005 13:27:46 GMT
> Do you just do this with terms or do you also
> extract phrases?   

The scheme involves these phases:
1) Identify top terms (using algo described)
2) Identify all term "runs" in original text.
3) Identify sensible phrases from large list of term
4) Provide shortlist of top scoring terms AND phrases

Step 1 is done as described in my earlier post.
Step 2 I currently do be re-running an Analyzer on the
original text. It is possible that this could be done
using the RAMDirectory used in Step 1 and SpanQueries
or some such but I have found it is important to
resort  to the original text to get sensible
If your indexed content used stemming and stop word
removal and you *didn't* look at the original text you
would identify phrases like "united state america"
instead of "United States of America".
Step 3 is needed to consolidate all of the learning
about term usage. For example, the code may choose to
collapse the run "United States Of America invades"
into the shorter "United States" run because it occurs
much less and all of the shorter run's terms are in
the longer one.
Step 4 ranks the phrases and terms to produce a
shortlist consisting of both. Some terms are always
used in phrases (so will not be selected as a single
term). Some terms *never* appear in a phrase so are
considered for shortlisting.

There's probably a number of ways in which these
different phases can be implemented but I've found
them all to be necessary if you want to present the
findings in a readable form to end-users.

How much free photo storage do you get? Store your holiday 
snaps for FREE with Yahoo! Photos

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message