lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From syedfa <fayyazud...@gmail.com>
Subject Creating an index from an XML file using Lucene in Java
Date Sun, 27 Jul 2008 17:58:50 GMT

Dear fellow Java/Lucene developers:

I have a question on creating an index from an XML document for the purpose
of searching using the Lucene API in Java.  

I am searching shakespeare's "Hamlet" which I have as an xml document.  I
want to include comentary on each scene and would like to make this section
searchable as well for the user.  However, at present, I search through a
set of <SPEECH> tags which represents a particular character's dialogue. 
With my new arrangement, each scene, which is composed of several characters
respective dialogues, will be enclosed in a pair of <SCENE></SCENE> tags,
and will have a set of <SCENE-COMMENTARY></SCENE-COMMENTARY> tags at the top
which will provide the commentary for the scene that follows.  How would I
modify my index code (which follows after the xml document) to create a
searchable index which allows the user to search <SCENE-COMMENTARY> section
just as easily as the text contained in the <SPEECH> tags?  Once I have
accomplished this, I would like to then be able to search the text and
display the results to the user just as easily as if they were searching
through the <SPEECH> tags.  
I have also listed the code for searching through the current index.

Thanks in advance to everyone who replies.

Sincerely;
Fayyaz


Here is the xml snippet for the play:

<PLAY>
<TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>
<SCENE>
<SCENE-COMMENTARY>Here is where I will include commentary on the scene that
follows, which I would also like to make searchable to the
user.</SCENE-COMMENTARY>
<SPEECH>
<REFERENCE>ACT 1, SCENE 1</REFERENCE>
<SPEAKER>LORD POLONIUS</SPEAKER>
<LINES>Yet here, Laertes! aboard, aboard, for shame!
The wind sits in the shoulder of your sail,
And you are stay'd for. There; my blessing with thee!
And these few precepts in thy memory
See thou character. Give thy thoughts no tongue,
Nor any unproportioned thought his act.
Be thou familiar, but by no means vulgar.
Those friends thou hast, and their adoption tried,
Grapple them to thy soul with hoops of steel;
But do not dull thy palm with entertainment
Of each new-hatch'd, unfledged comrade. Beware
Of entrance to a quarrel, but being in,
Bear't that the opposed may beware of thee.
Give every man thy ear, but few thy voice;
Take each man's censure, but reserve thy judgment.
Costly thy habit as thy purse can buy,
But not express'd in fancy; rich, not gaudy;
For the apparel oft proclaims the man,
And they in France of the best rank and station
Are of a most select and generous chief in that.
Neither a borrower nor a lender be;
For loan oft loses both itself and friend,
And borrowing dulls the edge of husbandry.
This above all: to thine ownself be true,
And it must follow, as the night the day,
Thou canst not then be false to any man.
Farewell: my blessing season this in thee!</LINES>
</SPEECH>
<SPEECH>
<SPEAKER>HAMLET</SPEAKER>
<REFERENCE>ACT 1, SCENE 2</REFERENCE>
<LINES>To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there's the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor's wrong, the proud man's contumely,
The pangs of despised love, the law's delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscover'd country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pith and moment
With this regard their currents turn awry,
And lose the name of action.--Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remember'd.</LINES>
</SPEECH>
<SPEECH>
<REFERENCE>ACT 1, SCENE 3</REFERENCE>
<SPEAKER>HAMLET</SPEAKER>
<LINES>To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there's the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor's wrong, the proud man's contumely,
The pangs of despised love, the law's delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscover'd country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pith and moment
With this regard their currents turn awry,
And lose the name of action.--Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remember'd.</LINES>
</SPEECH>
<SPEECH>
<REFERENCE>ACT 1, SCENE 4</REFERENCE>
<SPEAKER>HAMLET</SPEAKER>
<LINES>To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there's the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor's wrong, the proud man's contumely,
The pangs of despised love, the law's delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscover'd country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pith and moment
With this regard their currents turn awry,
And lose the name of action.--Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remember'd.</LINES>
</SPEECH>
</SCENE>
</PLAY>


Here is my indexing code:

package hamlet;
 
import java.io.InputStream;
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.util.Iterator;
import java.util.HashMap;
 
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
 
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.SAXException;
import org.xml.sax.Attributes;
 
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;
 
public class HamletHandler extends DefaultHandler implements DocumentHandler
{
 
    //the directory that stores xml files
    private final String dataDir  = "c:\\dataD";
    //the directory that is used to store lucene index
    private final String indexDir = "c:\\indexD";
    
	private StringBuffer elementBuffer=new StringBuffer();
	private HashMap attributeMap;
	private Document doc;
	static IndexWriter indexWriter;
	
	
	public Document getDocument(InputStream is) throws DocumentHandlerException
{
		// TODO Auto-generated method stub
		SAXParserFactory spf=SAXParserFactory.newInstance();
		
		try{
			SAXParser parser=spf.newSAXParser();
			parser.parse(is, this);
		}
		catch(IOException e){
			throw new DocumentHandlerException("Cannot parse XML document", e);
		}
		
		catch(ParserConfigurationException e){
			throw new DocumentHandlerException("Cannot parse XML document", e);
		}
		
		catch(SAXException e){
			throw new DocumentHandlerException("Cannot parse XML document", e);
		}
		
		return doc;
	}
 
	public void startDocument(){
		//doc=new Document();
	}
	
	public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException{
	
		if(qName.equals("SPEECH")){
			doc=new Document();
		}
		elementBuffer.setLength(0);
		//attributeMap.clear();
		if(atts.getLength()>0){
			attributeMap=new HashMap();
			for(int i=0; i<atts.getLength(); i++){
				attributeMap.put(atts.getQName(i), atts.getValue(i));
			}
		}
	}
	public void characters(char[] text, int start, int length){
		elementBuffer.append(text, start, length);
		
	}
	
	public void endElement(String uri, String localName, String qName) throws
SAXException{
		
		try {
			
			if(qName.equals("REFERENCE")){
				Field reference = new Field(qName, elementBuffer.toString(),
Field.Store.YES, Field.Index.NO, Field.TermVector.NO);
				doc.add(reference);
			}
			
			else if(qName.equals("SPEAKER")){
				Field speaker = new Field(qName, elementBuffer.toString(),
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES);
				speaker.setBoost(2.0f);
				doc.add(speaker);
			}
			else if(qName.equals("LINES")){
				Field lines = new Field(qName, elementBuffer.toString(),
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES);
				lines.setBoost(1.0f);
				doc.add(lines);
				indexWriter.addDocument(doc);
							}
			else{
				return;
			}
				
		} catch (CorruptIndexException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
 
		
		
	}
	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception{
		File index=new File("c:\\Documents and Settings\\Fayyazuddin A Syed\\My
Documents\\indexD");
		Directory fsDirectory = FSDirectory.getDirectory(index);
    	Analyzer  analyzer    = new StandardAnalyzer();
    	indexWriter = new IndexWriter(fsDirectory, analyzer, true);
		HamletHandler handler=new HamletHandler();
		Document doc=handler.getDocument(new FileInputStream(new File(args[0])));
		int numIndexed=indexWriter.docCount();
		System.out.println(numIndexed);
		indexWriter.optimize();
    	indexWriter.close();
 
	}
 
}


and here is my searcher code:

package search;
 
/*
 * Searcher.java
 *
 * Created on August 6, 2007, 8:46 PM
 *
 * To change this template, choose Tools | Template Manager
 * and open the template in the editor.
 */
 
 
 
import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.io.StringReader;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.io.IOException;
 
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.CachingTokenFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer ;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.FuzzyLikeThisQuery;
import org.apache.lucene.search.Query ;
import org.apache.lucene.search.Scorer;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.Fragmenter;
import org.apache.lucene.search.highlight.NullFragmenter;
import org.apache.lucene.search.Hits;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.queryParser.QueryParser;
/**
 *
 * 
 */
public class Searcher {
    
    /** Creates a new instance of Searcher */
    
    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws Exception{ 
        
    	Searcher searchDoc=new Searcher();
        File indexDir=new File("c:\\Documents and Settings\\Fayyazuddin A
Syed\\My Documents\\indexD");
        String q="SLINGS AND ARROWS";
        String s="think~";
        if (s.contains("?") || s.contains("*")){
        	System.out.println("this is a wildcard search");
        }
        else if (s.contains("~")){
        	System.out.println("this is a fuzzy search");
        }
        else {
        	System.out.println("this is a normal search");
        }
        
        if(!indexDir.exists() || !indexDir.isDirectory()){
            throw new Exception(indexDir + "does not exist of is not a
directory."); 
        }
        //searchDoc.wildSearch(indexDir);
        searchDoc.search(indexDir, q);
        //searchDoc.fuzzySearch(indexDir);
        
        
    }
    
    public List search(File indexDir, String q) throws Exception {
         
    	List searchResult = new ArrayList();
        Directory fsDir=FSDirectory.getDirectory(indexDir);
        IndexSearcher is=new IndexSearcher(fsDir);
        
        Analyzer analyser = new StandardAnalyzer();
        Query parser=new QueryParser("LINES", analyser).parse(q);
        long start=new Date().getTime();
        Hits hits=is.search(parser);
        long end=new Date().getTime();
        QueryScorer scorer = new QueryScorer(parser);
        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("", "");
    	Highlighter highlighter = new Highlighter(formatter, scorer);
    	Highlighter high = new Highlighter(formatter, scorer);
    	Fragmenter fragmenter = new NullFragmenter();
    	Fragmenter fragment = new SimpleFragmenter(250);
    	highlighter.setTextFragmenter(fragmenter);
    	high.setTextFragmenter(fragment);
    	
        for(int i=0; i<hits.length(); i++){
        	Document doc=hits.doc(i);
        	String lns = doc.get("LINES");
         TokenStream lines = analyser.tokenStream("LINES", new
StringReader(lns));
         CachingTokenFilter filter = new CachingTokenFilter(lines);
         String highlightedLines = highlighter.getBestFragment(filter, lns);
            filter.reset();
         String highlight = high.getBestFragment(filter, lns);
        	SearchResult resultBean = new SearchResult();
        	resultBean.setReference(hits.doc(i).get("REFERENCE"));
        	resultBean.setNarrator(hits.doc(i).get("SPEAKER"));
        	resultBean.setHitResult(highlight);
        	resultBean.setQuote(highlightedLines);
        	searchResult.add(resultBean);
        	System.out.println(resultBean.getReference());
        	System.out.println(resultBean.getNarrator());
         	System.out.println(resultBean.getHitResult());
         	System.out.println("");
        	System.out.println(resultBean.getQuote());
        	System.out.println("");
        }
        
        System.err.println("Found " + hits.length() + " document(s)(in " +
(end-start) + " milliseconds) that matched query '" + q + "':"); 
        
        return searchResult;        
    }
    
    public List wildSearch(File indexDir) throws Exception {
    	
        List searchResult = new ArrayList();
        Directory fsDir=FSDirectory.getDirectory(indexDir);
        IndexSearcher is = new IndexSearcher(fsDir); 
        IndexReader ir = IndexReader.open(fsDir);               
        Analyzer analyser = new StandardAnalyzer();
        Query parser=new WildcardQuery(new Term("LINES", "the*"));
        parser=parser.rewrite(ir);
        long start=new Date().getTime();
        Hits hits=is.search(parser);
        long end=new Date().getTime();
        QueryScorer scorer = new QueryScorer(parser);
        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("", "");
    	Highlighter highlighter = new Highlighter(formatter, scorer);
        	Highlighter high = new Highlighter(formatter, scorer);
        	Fragmenter fragmenter = new NullFragmenter();
    	Fragmenter fragment = new SimpleFragmenter(250);
    	highlighter.setTextFragmenter(fragmenter);
    	high.setTextFragmenter(fragment);
    	
        for(int i=0; i<hits.length(); i++){
        	Document doc=hits.doc(i);
        	String lns = doc.get("LINES");
         TokenStream lines = analyser.tokenStream("LINES", new
StringReader(lns));
            CachingTokenFilter filter = new CachingTokenFilter(lines);
            String highlightedLines = highlighter.getBestFragment(filter,
lns);
            filter.reset();
            String highlight = high.getBestFragment(filter, lns);
            SearchResult resultBean = new SearchResult();
        	   resultBean.setNarrator(hits.doc(i).get("SPEAKER"));
        	   resultBean.setHitResult(highlight);
        	   resultBean.setQuote(highlightedLines);
        	   searchResult.add(resultBean);
        	   System.out.println(resultBean.getNarrator());
         	   System.out.println(resultBean.getHitResult());
         	   System.out.println("");
        	   System.out.println(resultBean.getQuote());
        	   System.out.println("");
        }
        
        System.err.println("Found " + hits.length() + " document(s)(in " +
(end-start) + " milliseconds) that matched query '" + "':"); 
        
        return searchResult;
    }
    
public List fuzzySearch(File indexDir) throws Exception {
    	
        List searchResult = new ArrayList();
        Directory fsDir=FSDirectory.getDirectory(indexDir);
        IndexSearcher is = new IndexSearcher(fsDir); 
        IndexReader ir = IndexReader.open(fsDir);               
        Analyzer analyser = new StandardAnalyzer();
        Query parser=new FuzzyQuery(new Term("LINES", "the~"));
        parser=parser.rewrite(ir);
        long start=new Date().getTime();
        Hits hits=is.search(parser);
        long end=new Date().getTime();
        QueryScorer scorer = new QueryScorer(parser);
        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("", "");
    	Highlighter highlighter = new Highlighter(formatter, scorer);
        	Highlighter high = new Highlighter(formatter, scorer);
        	Fragmenter fragmenter = new NullFragmenter();
    	Fragmenter fragment = new SimpleFragmenter(250);
    	highlighter.setTextFragmenter(fragmenter);
    	high.setTextFragmenter(fragment);
    	
        for(int i=0; i<hits.length(); i++){
        	Document doc=hits.doc(i);
        	String lns = doc.get("LINES");
         TokenStream lines = analyser.tokenStream("LINES", new
StringReader(lns));
            CachingTokenFilter filter = new CachingTokenFilter(lines);
            String highlightedLines = highlighter.getBestFragment(filter,
lns);
            filter.reset();
            String highlight = high.getBestFragment(filter, lns);
            SearchResult resultBean = new SearchResult();
        	resultBean.setNarrator(hits.doc(i).get("SPEAKER"));
        	resultBean.setHitResult(highlight);
        	resultBean.setQuote(highlightedLines);
        	searchResult.add(resultBean);
        	System.out.println(resultBean.getNarrator());
         	System.out.println(resultBean.getHitResult());
         	System.out.println("");
        	System.out.println(resultBean.getQuote());
        	System.out.println("");
        }
        
        System.err.println("Found " + hits.length() + " document(s)(in " +
(end-start) + " milliseconds) that matched query '" + "':"); 
        
        return searchResult;
    }
} 



-- 
View this message in context: http://www.nabble.com/Creating-an-index-from-an-XML-file-using-Lucene-in-Java-tp18678779p18678779.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message