lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Halácsy Péter <halacsy.pe...@axelero.com>
Subject summarizing & highlighting
Date Mon, 15 Apr 2002 23:28:00 GMT
Hello,
I implemented a summarizing & highlighting component that can be used to summarize longer
texts to present on result page. It's not well-commented/documented but maybe it can be used
by others.

algorithm:
1. extract terms of query (needs Lucene modification: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00037.html)
2. tokenize the text and collect the set T of tokens that could be highlighted (term in the
token is a query term); call this set to 
3. make fragments: a fragment is a token pair of T (more formal element of TxT) ; it's a substring
of the text from leftToken to rightToken (leftToken can be equal to rightToken)
4. sort the fragments based on their weight (lenght of the fragment, how much tokens are in
the fragment)
5. get the first N fragments where N is less than a limit (maxFragments; default 3) and the
length of the fragments is less than a limit (maxLen; default 300)
6. make the output string and highligh tokens that are in T

I tested it on relative short text. I know this is not too good algorithm, I'm planning to
improve.

peter


Mime
View raw message