lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Hull <char...@flax.co.uk>
Subject Re: Word / PDF document snippet rendering in search
Date Fri, 02 Mar 2018 10:05:58 GMT
On 02/03/2018 00:15, T Wild wrote:
> I'm interested in building a software system which will connect to various
> document sources, extract the content from the documents contained within
> each source, and make the extracted content available to a search engine
> such Solr. This search engine will serve as the back-end for a web-based
> search application.
This is basically an 'enterprise search' system. You use 'connectors' to 
get text out of the source documents - in Solr applications we often use 
Apache Tika to extract text from common formats like Office or PDF, 
Apache ManifoldCF is another useful project for connecting to repositories.

> 
> I'm interested in rendering snippets of these documents in the search
> results for well-known types, such as Microsoft Word and PDF. How would one
> go about implementing document snippet rendering in search?

If you just want the snippets as text, you can use Solr highlighters 
which can provide contextual snippets (i.e chunks of text around the 
query matches).
> 
> I'd be happy with serving up these snippets in any format, including as
> images. I just want to be able to give my users some kind of formatted
> preview of their results for well-known types.

If you however want to show bits of the original documents that's more 
difficult. You'll need to store a reference to the original document in 
Solr and use an external system to display it - you'll need specific 
systems for different doc types: PDFs can be shown in various browser 
plugins for example. Another approach is illustrated in this open source 
code we wrote a while ago - it uses OpenOffice in 'headless' mode to 
provide images of the source document:
https://github.com/flaxsearch/flaxcode/tree/master/flax_basic/libs/previewgen

Hope this helps!

Cheers

Charlie
> 
> Thank you!
> 


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Mime
View raw message