lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joey Hanzel <phan...@nearinfinity.com>
Subject Re: Solr ExtractingRequestHandler with Compressed files
Date Tue, 26 Oct 2010 13:52:15 GMT
Hi Javendra,

Thanks for the suggestion, I updated to Solr 1.4.1 and Solr Cell 1.4.1 and
tried sending a zip file that contained several html documents.
Unfortunately, that did not solve the problem.

Here's the curl command I used:
curl "
http://localhost:8983/solr/update/extract?literla.id=doc1@uprefix=attr_&fmap.content=attri_content&commit=true"
-F "file=data.zip"

When I query for id:doc1, the attr_content lists each filename within the
zip archive. It also indexed the stream_size, stream_source and
content_type.  It does not appear to be opening up the individual files
within the zip.

Did you have to make any other configuration changes to your solrconfig.xml
or schema.xml to read the contents of the individual files?  Would it help
to pass the specific mime type on the curl line ?

On Mon, Oct 25, 2010 at 3:27 PM, Jayendra Patil <
jayendra.patil.001@gmail.com> wrote:

> There was this issue with the previous version of Solr, wherein only the
> file names from the zip used to get indexed.
> We had faced the same issue and ended up using the Solr trunk which has the
> Tika version upgraded and works fine.
>
> The Solr version 1.4.1 should also have the fix included. Try using it.
>
> Regards,
> Jayendra
>
> On Fri, Oct 22, 2010 at 6:02 PM, Joey Hanzel <phanzel@nearinfinity.com
> >wrote:
>
> > Hi,
> >
> > Has anyone had success using ExtractingRequestHandler and Tika with any
> of
> > the compressed file formats (zip, tar, gz, etc) ?
> >
> > I am sending solr the archived.tar file using curl. curl "
> >
> >
> http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true
> > "
> > -H 'Content-type:application/octet-stream' --data-binary
> > "@/home/archived.tar"
> > The result I get when I query the document is that the filenames inside
> the
> > archive are indexed as the "body_texts", but the content of those files
> is
> > not extracted or included.  This is not the behvior I expected. Ref:
> >
> >
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example
> > .
> > When I send 1 of the actual documents inside the archive using the same
> > curl
> > command the extracted content is then stored in the "body_texts" field.
>  Am
> > I missing a step for the compressed files?
> >
> > I have added all the extraction depednenices as indicated by mat in
> > http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-celland
> > am able to succesfully extract data from MS Word, PDF, HTML documents.
> >
> > I'm using the following library versions.
> >  Solr 1.40,  Solr Cell 1.4.1, with Tika Core 0.4
> >
> > Given everything I have read this version of Tika should support
> extracting
> > data from all files within a compressed file.  Any help or suggestions
> > would
> > be appreciated.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message