lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ulrich Mayring <>
Subject Re: New Lucene-powered Website
Date Fri, 28 Nov 2003 09:21:11 GMT
Akmal Sarhan wrote:
> nice and fast ;-)
> would be interesting though to know how you implemented the "summarizer".

Basically, the algorithm is statistics-based. It selects a configurable 
number of sentences from the text (in our case three) and, in case the 
sentences are too long, it cuts off after a configurable number of 
characters (in our case 350).

So the sentences you read are real sentences from the text, therefore I 
don't have to worry about syntax and can concentrate on semantics. The 
trick is to find sentences that are especially relevant to the query. 
There are a number of heuristics that can be employed, like the number 
of times the search terms appear or their position within the sentences.

Apart from that I've found that quality increases significantly, when 
the text is "cleaned up" before summarising. That involves identifying 
"real" sentences, so your summary doesn't start with "Home News Contact 
Sitemap" or somesuch. There's a lot of text on a web page, which is not 
content and therefore detrimental to the summariser (and I'm not even 
speaking about tags).

Also, highly structured documents are a problem (for example, search for 
"dissolution" on our website). Headlines are usually very important 
sentences that should score higher than regular sentences, but they are 
not easy to identify as sentences, because they don't end with a full-stop.

This "clean-up work" is actually trickier than the summarising itself 
and it is usually very domain-specific. That's the reason why I haven't 
proposed to contribute the summariser to Lucene, because the clean-up 
code is not generic. The summariser itself is just one class with 300 
lines, but without prior clean-up the quality of its summaries is 



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message