lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lance Norskog (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-3975) Document Summarization toolkit, using LSA techniques
Date Mon, 22 Oct 2012 10:34:12 GMT

     [ https://issues.apache.org/jira/browse/SOLR-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lance Norskog updated SOLR-3975:
--------------------------------

    Description: 
This package analyzes sentences and words as used across sentences to rank the most important
sentences and words. The general topic is called "document summarization" and is a popular
research topic in textual analysis. 

How to use:
1) Check out the 4.x branch, apply the patch, build, and run the solr/example instance.
2) Download the first Reuters article corpus from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
3) Unpack this into a directory.
4) Run the attached 'reuters.sh' script:
sh reuters.sh directory http://localhost:8983/solr/collection1
5) Wait several minutes.

Now go to http://localhost:8983/solr/collection1/browse?summary=true and look at the large
gray box marked 'Document Summary'. This has a table of statistics about the analysis, the
three most important sentences, and several of the most important words in the documents.
The sentences have the important words in italics.

The code is packaged as a search component and as an analysis handler. The /browse demo uses
the search component, and you can also post raw text to  http://localhost:8983/solr/collection1/analysis/summary.
Here is a sample command:
{code}
curl -s "http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml"
--data-binary @$FILE -H 'Content-type:application/xml'
{code}

This is an implementation of LSA-based document summarization. A short explanation and a long
evaluation are described in my blog, [Uncle Lance's Ultra Whiz Bang|http://ultrawhizbang.blogspot.com],
starting here: [http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]



  was:
This package analyzes sentences and words as used across sentences to rank the most important
sentences and words. The general topic is called "document summarization" and is a popular
research topic in textual analysis. 

How to use:
1) Check out the 4.x branch, apply the patch, build, and run the solr/example instance.
2) Download the first Reuters article corpus from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
3) Unpack this into a directory.
4) Run the attached 'reuters.sh' script:
sh reuters.sh directory http://localhost:8983/solr/collection1
5) Wait several minutes.

Now go to http://localhost:8983/solr/collection1/browse?summary=true and look at the large
gray box marked 'Document Summary'. This has a table of statistics about the analysis, the
three most important sentences, and several of the most important words in the documents.
The sentences have the important tags in italics.

The code is packaged as a search component and as an analysis handler. The /browse demo uses
the search component, and you can also post raw text to  http://localhost:8983/solr/collection1/analysis/summary.
Here is a sample command:
curl -s "http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml"
--data-binary @$FILE -H 'Content-type:application/xml'

This is an implementation of LSA-based document summarization. A short explanation and a long
evaluation are described in my blog, [Uncle Lance's Ultra Whiz Bang|http://ultrawhizbang.blogspot.com],
starting here: [http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]



    
> Document Summarization toolkit, using LSA techniques
> ----------------------------------------------------
>
>                 Key: SOLR-3975
>                 URL: https://issues.apache.org/jira/browse/SOLR-3975
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: 4.1.summary.patch, reuters.sh
>
>
> This package analyzes sentences and words as used across sentences to rank the most important
sentences and words. The general topic is called "document summarization" and is a popular
research topic in textual analysis. 
> How to use:
> 1) Check out the 4.x branch, apply the patch, build, and run the solr/example instance.
> 2) Download the first Reuters article corpus from:
> http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
> 3) Unpack this into a directory.
> 4) Run the attached 'reuters.sh' script:
> sh reuters.sh directory http://localhost:8983/solr/collection1
> 5) Wait several minutes.
> Now go to http://localhost:8983/solr/collection1/browse?summary=true and look at the
large gray box marked 'Document Summary'. This has a table of statistics about the analysis,
the three most important sentences, and several of the most important words in the documents.
The sentences have the important words in italics.
> The code is packaged as a search component and as an analysis handler. The /browse demo
uses the search component, and you can also post raw text to  http://localhost:8983/solr/collection1/analysis/summary.
Here is a sample command:
> {code}
> curl -s "http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml"
--data-binary @$FILE -H 'Content-type:application/xml'
> {code}
> This is an implementation of LSA-based document summarization. A short explanation and
a long evaluation are described in my blog, [Uncle Lance's Ultra Whiz Bang|http://ultrawhizbang.blogspot.com],
starting here: [http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message