lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Ott <>
Subject Re: Feasability
Date Sun, 04 Dec 2016 12:30:33 GMT
I would recommend to use Apache Tika if you need to extract text from files
of different types.  Going to PdfBox or POI is required if you need to dig
into internals of these file formats, but if you only need text, then Tika
will be easier choice...

On Sun, Dec 4, 2016 at 4:01 AM, Ted Dunning <> wrote:

> On Thu, Dec 1, 2016 at 12:33 PM, Chris Manu <>
> wrote:
> > Thank you for responding. So, theoretically, I would need to hire someone
> > with Apache programing experience to do this correct (given that I know
> > nothing about programing)? What type of experience should I look for?
> >
> Chris,
> In addition to the Solr recommendation that you are hearing (which is a
> fine one), you should expand your search to include Elasticsearch.
> Elasticsearch is based on Apache software, but is not itself an Apache
> project for the overall system.
> What you describe (pulling words from one place, finding them in another)
> is very doable with Apache software.
> In addition to the search function, you should look at the PdfBox project
> for extracting data from PDF files. The Apache POI project has software
> that will help you get data from excel files.

With best wishes,                    Alex Ott
Twitter: alexott_en (English), alexott (Russian)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message