lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: easy way to figure out most common tokens?
Date Wed, 15 Aug 2012 18:29:34 GMT
I don't see how you could without indexing everything first
since you can't know what the most frequent terms until
you've processed all your documents....

If you know these terms in advance, it seems like you could
just call then stopwords and use the common stopword
processing.

If you have to examine your corpus in the first place,
it seems like you could do something with term
frequencies to extract the most common terms from
your index then re-index all your data with those terms
as stopwords..

Best
Erick

On Wed, Aug 15, 2012 at 11:46 AM, Shaya Potter <spotter@gmail.com> wrote:
> Is there an easy way to figure out the most common tokens and then remove
> those tokens from the documents.
>
> use case: imagine one is indexing a mailing list (such as this java-user)
> and is extracting all e-mail addresses in the messages and adding them to a
> doc.
>
> What that means is that one will be a lot of
>
> java-user-unsubscribe@lucene.apache.org
> java-user-help@lucene.apache.org
>
> due to that being in the signature of each email.
>
> while, the best approach might be to not put it in the index in the first
> place, I'm wondering if there's a good way to process the index after the
> fact to remove these type of entries.
>
> thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message