lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Markus.Rietz...@rzf.fin-nrw.de>
Subject AW: solr cell/tika: pdf import with xml metatags
Date Tue, 27 Oct 2009 10:49:23 GMT
thanxs,
i know and read that page. sending additional meta-tags with the curl call is 
no problem. i only thought that there might be a way to use the xml-approach
also with PDF files. i'll go the "curl"-way for that files.

--
mit freundlichen Grüßen

Markus Rietzler - <rietzler_software/>
Rechenzentrum der Finanzverwaltung NRW
0211/4572-2130
 

> -----Ursprüngliche Nachricht-----
> Von: Grant Ingersoll [mailto:gsingers@apache.org] 
> Gesendet: Dienstag, 27. Oktober 2009 11:43
> An: solr-user@lucene.apache.org
> Betreff: Re: solr cell/tika: pdf import with xml metatags
> 
> 
> On Oct 27, 2009, at 6:36 AM, <Markus.Rietzler@rzf.fin-nrw.de> 
> <Markus.Rietzler@rzf.fin-nrw.de 
>  > wrote:
> 
> > hi,
> >
> > we want to use SOLR as our intranet search engine.
> > i downloaded the nightly bild of solr 1.4. pdf extraction does via  
> > Solr Cell/Tika. i can send the pdf via curl
> > to solr.
> >
> > we do have a large set of meta-tags to all our intranet documents,  
> > including PDF, PPT etc. to import html
> > files from our CMS i have access to all of this meta tags 
> and create  
> > a xml document which i send to SOLR,
> >
> > eg.
> >
> > <?xml version='1.0' encoding='UTF-8'?>
> > <add>
> > <doc>
> > <field name="id">1</field>
> > <field name="title">this is the title</field>
> > </doc>
> > <doc>
> > <field name="id">2</field>
> > <field name="title">this is another title</field>
> > </doc>
> > <doc>
> > <field name="id">3</field>
> > <field name="title">this is the third title</field>
> > </doc>
> > </add>
> >
> > this works fine with html files where i can grab all the 
> meta tags,  
> > including "body".
> >
> > so my question is, can i use this xml-document to send a pdf file  
> > also?
> 
> I'm not sure what you mean here, can you clarify?  PDF and other  
> "rich" documents can't be sent by XML.
> 
> > ok, one way would be to use
> > the extracthandler with extract only and put the data in 
> the "body"- 
> > field.
> 
> I guess all I can point you at right now is the wiki:  
> http://wiki.apache.org/solr/ExtractingRequestHandler
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 

Mime
View raw message