lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <>
Subject [jira] [Commented] (SOLR-3975) Document Summarization toolkit, using LSA techniques
Date Wed, 24 Oct 2012 05:04:21 GMT


Otis Gospodnetic commented on SOLR-3975:

Nice, 170KB patch there Lance! :)
I see lots of classes don't have ASL btw.
> Document Summarization toolkit, using LSA techniques
> ----------------------------------------------------
>                 Key: SOLR-3975
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: 4.1.summary.patch,
> This package analyzes sentences and words as used across sentences to rank the most important
sentences and words. The general topic is called "document summarization" and is a popular
research topic in textual analysis. 
> How to use:
> 1) Check out the 4.x branch, apply the patch, build, and run the solr/example instance.
> 2) Download the first Reuters article corpus from:
> 3) Unpack this into a directory.
> 4) Run the attached '' script:
> sh directory http://localhost:8983/solr/collection1
> 5) Wait several minutes.
> Now go to http://localhost:8983/solr/collection1/browse?summary=true and look at the
large gray box marked 'Document Summary'. This has a table of statistics about the analysis,
the three most important sentences, and several of the most important words in the documents.
The sentences have the important words in italics.
> The code is packaged as a search component and as an analysis handler. The /browse demo
uses the search component, and you can also post raw text to  http://localhost:8983/solr/collection1/analysis/summary.
Here is a sample command:
> {code}
> curl -s "http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml"
--data-binary @$FILE -H 'Content-type:application/xml'
> {code}
> This is an implementation of LSA-based document summarization. A short explanation and
a long evaluation are described in my blog, [Uncle Lance's Ultra Whiz Bang|],
starting here: []

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message