lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: pdf and highlighting
Date Thu, 08 Dec 2005 13:02:10 GMT
Wow, those were some great details.  But, as I hope you've seen with  
some other recent issues, things become so much clearer when you can  
isolate the issues.  This is one reason that test-driven development  
with unit tests is so amazingly helpful.  If you could isolate a  
single PDF going through the processes, individually, such as  
parsing, tokenization, indexing, and then highlighting, you will  
likely find the problem.  It is difficult, at best, to follow along  
the various layers you have created without really digging with a lot  
of time.

If you have a unit test that still fails your expectations, then a  
JUnit test and the offending PDF file would likely yield immediate  
solutions from this list.

	Erik

On Dec 8, 2005, at 6:04 AM, Sonja Löhr wrote:

>
> Hi, Eric and the other experts!
>
> I'll try to collect some code fragments.
> Many things are configurable and I wrote a Crawler for indexing,  
> but the
> rest is very close to the examples in "Lucene in Action". I hope I  
> chose the
> appropriate snippets.
>
> The analyzer I use is created once and stored in a Config object made
> available to almost every class, along with other configurable data.
>
> INDEXING:
>
> in JTidyHtmlHandler (extends CrawlDocumentHandler):
> 	// getBody() extracts the textual content under <body>
>     String body = getBody(rawDoc);
>     if(body == null) {
>     	return null;
>     }
>     setMainField(doc, body);
>
> =============================================================
> in PdfBoxPDFHandler (extends CrawlDocumentHandler):
>
>       PDFTextStripper stripper = new PDFTextStripper();
>       pddoc = new PDDocument(cosDoc);
>       docText = stripper.getText(pddoc);
> 	[...]
>       if (docText != null) {
>         	setMainField(doc, docText);
>        }
>
> =================================================================
> in CrawlDocumentHandler implements DocumentHandler (as found in Eric's
> book):
> 	public void setMainField(Document doc, String txt) {
> 		if (txt == null || txt.equals("")) return;
> 		if(conf.storeMainField()) {
> 			doc.add(Field.Text(conf.mainFieldName, txt));
> 		}
> 		else doc.add(Field.UnStored(conf.mainFieldName, txt));
>
> 	}
> ===================================================================
>
> In CrawlIndexer:
> while(crawler.hasNext()) {
>    CrawlDocumentHandler handler = getHandler(assoc, suffix, mime);
>    ...
>    doc = handler.getDocument(onlineDoc.getIn());
>     if (doc != null) {
> 	    	doc.add(Field.Keyword("url", onlineDoc.getUrl()));
> 	    	Iterator writers =
> config.getWritersForUrl(onlineDoc.getUrl()).iterator();
> 	    	while(writers.hasNext()) {
> 	    		((IndexWriter)writers.next()).addDocument(doc);
> 	    	}
> 		
> 	}
> }
> ====================================================================
>
> (I have a Set of Index Objects each storing its writer which is  
> initialised
> like this, analyzer again comes from Config:
>
> this.writer = new IndexWriter(dir, analyzer, true);
>
> =====================================================================
>
> Ok, now the index is made up with stored body text of the  
> documents, each
> analyzed with my Extension of GermanAnalyzer:
>
>
> GermanHtmlAnalyzer extends Analyzer:
> 	public TokenStream tokenStream(String fieldName, Reader reader)  {
> 		try {
> 			return new GermanAnalyzer().tokenStream(fieldName,
> resolveEntities(reader));
> 		}
> 		catch(IOException ioe) {
> 			return null;
> 		}		
> 	}
>
> ( resovleEntities returns a StringReader in which for example  
> &#252; or
> &uuml; are replaced by 'ü')
> ====================================================================== 
> ==
>
>
> SEARCH:
>
> //Here some snippets of the code that provides the JavaBeans to be  
> passed to
> some JSP page:
>
> // By now the only implementation is HtmlFragmentDisplay
> FragmentDisplay fragDisp =
> (FragmentDisplay)Class.forName(displayClassName).newInstance();
> IndexSearcher searcher = new IndexSearcher(dir);		
> Query q = MultiFieldQueryParser.parse(query, new String[]{"body",  
> "title"},
> conf.getAnalyzer());
> Hits hits = searcher.search(q);
> for( [hits to be shown to the user] ) {
> 	...
> 	if(conf.storeMainField()) {
> 		result.setFragment(fragDisp.getDisplayText(doc.get("body"),
> q));
> 	}
> 	else result.setFragment(fragDisp.getDisplayText(new
> URL(doc.get("url")), q));
> 	...
> 	results.add(result);
> }
>
> ====================================================================== 
> =====
>
> In HtmlFragmentDisplay:
>
> public String getDisplayText(String bodyText, Query query) {
>
> 	QueryScorer scorer = new QueryScorer(query);
> 	SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span
> class=\"highlighted\">","</span>");
> 	Highlighter highlighter = new Highlighter(formatter, scorer);
> 	Fragmenter fragmenter = new SimpleFragmenter(60);		
> 	highlighter.setTextFragmenter(fragmenter);
> 	Analyzer analyzer = conf.getAnalyzer();
> 	TokenStream tStream = analyzer.tokenStream("body", new
> StringReader(bodyText));
> 	return = highlighter.getBestFragments(tStream, bodyText, 4, " .....
> ");
> }
>
> (getDisplayText(URL url, Query query) fetches the document by its  
> URL, again
> uses the DocumentHandlers and finally calls the above method. I  
> switched
> from not storing the body text to storing it, but that didn't  
> affect the
> highlighting problem.
>
> ====================================================================== 
> =====
>
> So...... Result.getFragment() is what the users sees on the JSP page.
> If it happens to be taken from a JTidy-indexed Lucene document,  
> everything
> is well, if it comes from PdfBox, the wrong text is highlighted.
> I also tried with QueryParser.parse() instead of  
> MultiFieldQueryParser, but
> the output didn't change.
>
> Many many thanks if you read until here!
>
> And even more if you hava an idea where the error is likely to be  
> found.
>
> sonja
>
>
>
>
>> -----Original Message-----
>> From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
>> Sent: Donnerstag, 8. Dezember 2005 10:59
>> To: java-user@lucene.apache.org
>> Subject: Re: pdf and highlighting
>>
>> Sonja,
>>
>> Do you have an example, or at least some relevant code, that
>> would help the community in helping resolve this?
>>
>> 	Erik
>>
>> On Dec 8, 2005, at 4:24 AM, Sonja Löhr wrote:
>>
>>>
>>> Hi, all!
>>>
>>> I have a question concerning analysis and highlighting. I'm
>> indexing
>>> multiple document formats (up to now, only html and pdf
>> occured, and
>>> use the highlighter from the Lucene sandbox.
>>> The documents text is extracted via JTidy and PDFBox, respectively,
>>> then in both indexing and search analysed with a custom subclass of
>>> GermanAnalyzer, which replaces character references and
>> html entities.
>>>
>>> Now the funny thing is that, even if I store the body text,
>>> highlighter uses wrong positions with lucene Docs stemming from pdf
>>> documents, whereas html is hightlighted correctly.  I really don't
>>> have an explanation for this behaviour - for
>> doc.get("body") is simply
>>> text, in either case, and stop words were also removed in
>> ALL kinds of
>>> documents (and through an instance of the same analyzer passed to
>>> QueryParser.
>>>
>>> Are there any hints to where I could find my error - or did anybody
>>> else encounter the same problem?
>>>
>>> Thanks in advance!
>>>
>>> sonja
>>>
>>>
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message