lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: Reading data using Tika to Solr
Date Thu, 25 Oct 2018 19:56:44 GMT
If you’re processing actual msg (not eml), you’ll also need poi and
poi-scratchpad and their dependencies, but then those msgs could have
attachments, at which point, you may as just add tika-app. :D

On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) <MHQ@kmd.dk>
wrote:

> Hi Erick and Tim,
>
> Thanks for your answers, I can see that my mail got messed up on the way
> through the server. It looked much more readable at my end 😉 The
> attachment simply included my build-path.
>
> @Erick I am compiling the program using Netbeans at the moment.
>
> I updated to tika-1.7 but that did not help, and I haven't tried maven yet
> but will probably have to give that a chance. I just find it a bit odd that
> I can see the dependencies are included in the jar files I added to the
> project, but I must be missing something?
>
> My buildpath looks as follows:
>
> Tika-parsers-1.4.jar
> Tika-core-1.4.jar
> Commons-io-2.5.jar
> Httpclient-4.5.3
> Httpcore-4.4.6.jar
> Httpmime-4.5.3.jar
> Slf4j-api1-7-24.jar
> Jcl-over--slf4j-1.7.24.jar
> Solr-cell-7.5.0.jar
> Solr-core-7.5.0.jar
> Solr-solrj-7.5.0.jar
> Noggit-0.8.jar
>
>
>
> -----Original Message-----
> From: Tim Allison <tallison@apache.org>
> Sent: 25. oktober 2018 20:21
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> To follow up w Erick’s point, there are a bunch of transitive dependencies
> from tika-parsers. If you aren’t using maven or similar build system to
> grab the dependencies, it can be tricky to get it right. If you aren’t
> using maven, and you can afford the risks of jar hell, consider using
> tika-app or, better perhaps, tika-server.
>
> Stay tuned for SOLR-11721...
>
> On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson <erickerickson@gmail.com>
> wrote:
>
> > Martin:
> >
> > The mail server is pretty aggressive about stripping attachments, your
> > png didn't come though. You might also get a more informed answer on
> > the Tika mailing list.
> >
> > That said (and remember I can't see your png so this may be a silly
> > question), how are you executing the program .vs. compiling it? You
> > mentioned the "build path". I'm usually lazy and just execute it in
> > IntelliJ for development and have forgotten to set my classpath on
> > _numerous_ occasions when running it from a command line ;)
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ) <MHQ@kmd.dk>
> > wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I am trying to read content of msg-files using Tika and index these
> > > in
> > Solr, however I am having some problems with the OfficeParser(). I
> > keep getting the error java.lang.NoClassDefFoundError for the
> > OfficeParcer, even though both tika-core and tika-parsers are included
> in the build path.
> > >
> > >
> > >
> > >
> > >
> > > I am using Java with the following code:
> > >
> > >
> > >
> > >
> > >
> > > public static void main(final String[] args) throws
> > IOException,SAXException, TikaException {
> > >
> > >
> > >
> > >                             processDocument(pathtofile)
> > >
> > >
> > >
> > >                              }
> > >
> > >
> > >
> > >                             private static void
> > > processDocument(String
> > pathfilename)  {
> > >
> > >
> > >
> > >
> > >
> > >                                  try {
> > >
> > >
> > >
> > >                                                         File file =
> > > new
> > File(pathfilename);
> > >
> > >
> > >
> > >                                                         Metadata
> > > meta =
> > new Metadata();
> > >
> > >
> > >
> > >                                                          InputStream
> > input = TikaInputStream.get(file);
> > >
> > >
> > >
> > >
> > BodyContentHandler handler = new BodyContentHandler();
> > >
> > >
> > >
> > >                                                         Parser
> > > parser =
> > new OfficeParser();
> > >
> > >
> > > ParseContext
> > context = new ParseContext();
> > >
> > >
> > parser.parse(input, handler, meta, context);
> > >
> > >
> > >
> > >                                                          String
> > doccontent = handler.toString();
> > >
> > >
> > >
> > >
> > >
> > >
> >  System.out.println(doccontent);
> > >
> > >
> >  System.out.println(meta);
> > >
> > >
> > >
> > >                                  }
> > >
> > >                              }
> > >
> > > In the buildpath I have the following dependencies:
> > >
> > >
> > >
> > >
> > >
> > > Any help is appreciate.
> > >
> > >
> > >
> > > Thanks in advance.
> > >
> > >
> > >
> > > Best regards,
> > >
> > >
> > >
> > > Martin Hansen
> > >
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > > finder
> > du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler
> > oplysninger om dig.
> > >
> > > Protection of your personal data is important to us. Here you can
> > > read
> > KMD’s Privacy Policy outlining how we process your personal data.
> > >
> > > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> > information. Hvis du ved en fejltagelse modtager e-mailen, beder vi
> > dig venligst informere afsender om fejlen ved at bruge svarfunktionen.
> > Samtidig beder vi dig slette e-mailen i dit system uden at videresende
> > eller kopiere den. Selvom e-mailen og ethvert vedhæftet bilag efter
> > vores overbevisning er fri for virus og andre fejl, som kan påvirke
> > computeren eller it-systemet, hvori den modtages og læses, åbnes den
> > på modtagerens eget ansvar. Vi påtager os ikke noget ansvar for tab og
> > skade, som er opstået i forbindelse med at modtage og bruge e-mailen.
> > >
> > > Please note that this message may contain confidential information.
> > > If
> > you have received this message by mistake, please inform the sender of
> > the mistake by sending a reply, then delete the message from your
> > system without making, distributing or retaining any copies of it.
> > Although we believe that the message and any attachments are free from
> > viruses and other errors that might affect the computer or it-system
> > where it is received and read, the recipient opens the message at his or
> her own risk.
> > We assume no responsibility for any loss or damage arising from the
> > receipt or use of this message.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message