lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <jans...@parc.com>
Subject Re: Keyphrase Extraction (via Lingo)
Date Wed, 09 May 2007 18:13:10 GMT
> Dawid Weiss wrote:
> > You could also try splitting the document into paragraphs and use Carrot2's 
> > Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters. 
> > Labelling routine in Lingo should extract 'key' phrases; this analysis is 
> > heavily frequency-based, but... you know, you may want to try it.
> 
> Just to make sure I'm following...
> 
> So you're suggesting splitting the document into paragraphs, then
> treating each paragraph as if it were a Carrot2 search result,
> performing the clustering, then looking at the label Lingo chooses for
> each cluster, and treating that label as the "key phrase"?

I tried it.  Not so great results, but perhaps I'm doing it wrong.
Here's my code.  The input file is a text file with an ID number and
one paragraph per (long) line -- standard textual paragraphs.  I'm
running on a corpus of technical papers.

Bill

-----------------------------------------------------------------
import org.carrot2.filter.lingo.common.*;
import org.carrot2.filter.lingo.lsicluster.*;

import java.io.*;

public class test {

    public static void main (String[] argv) {
        DefaultClusteringContext context = new DefaultClusteringContext();
        try {
            BufferedReader r = new BufferedReader(new FileReader(argv[0]));
            String line;
            while ((line = r.readLine()) != null) {
                String[] parts = line.trim().split("\\s");
		// there must be an easier way to split off the first token
		// of a line...
                if (parts.length > 1) {
                    String id = parts[0];
		    // and to glue the other parts together again...
                    String body = parts[1];
                    for (int i = 2;  i < parts.length;  i++) {
                        body = body + " " + parts[i];
                    }
                    context.addSnippet(new Snippet(id, "", body));
                }
            }
        } catch (Exception x) {
            x.printStackTrace(System.err);
        }
        
        context.setQuery("");
        Cluster[] clusters = context.cluster();
        for (int i = 0;  i < clusters.length;  i++) {
            System.out.println("Cluster --");
            String[] labels = clusters[i].getLabels();
            for (int j = 0;  j < labels.length;  j++) {
                System.out.println("    Label:  " + labels[j]);
            }
            Snippet[] snippets = clusters[i].getSnippets();
            System.out.println("    " + snippets.length + " snippets:");
            for (int j = 0;  j < snippets.length;  j++) {
                System.out.println("         " + snippets[j].getSnippetId() +
		" -- " + snippets[j].getText());
            }
        }
    }
}
----------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message