lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <c...@alias-i.com>
Subject Re: Content Summarization
Date Tue, 19 Jun 2007 20:30:43 GMT
 >> Any one knows of a content summarization library.

> Take a look at LingPipe
> (http://alias-i.com/lingpipe/). 

I'm afraid LingPipe doesn't do content summarization.

Basically, it's an AI-hard problem as you
need fairly deep understanding in order not to
produce word salad.  Here are two relatively
recent tutorials, one from SIGIR and one from
ACL; I can vouch for both authors knowing their
stuff:

      http://www.isi.edu/~marcu/acl-tutorial.ppt

      http://www.summarization.com/sigirtutorial2001.ppt

Drago Radev, who wrote the second tutorial,
is distributing some relevant software these days:

      http://tangra.si.umich.edu/clair/clairlib/

and

      http://www.summarization.com/mead/

I haven't played with either.


The simplest summarization and a hard baseline
to beat is the first n sentences of a news article.

The problem beating that baseline is that newswire
is written to be read this way, with a summary
up front.  In order to summarize individual sentences,
the relevant parts need to be extracted and a new
grammatical sentence formulated.

The state of the art is only so-so at best, and not
what you'd want to give to an audience that cared about
precision or grammar.  Our experience with intelligence
analysts was that they preferred sentence-level snippets,
because those could at least be trusted not to garble
the underlying article.

If you just want a gist of a large bunch of news,
it can be useful.

The most well known research in this area is
coming out of Kathy McKeown's group at Columbia,
not to mention the horde of students she's graduated
over the last ten years, such as Drago Radev, the
author of the second tutorial and software above.

- Bob Carpenter
   Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message