lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Managing ZIP files inside ZIP files
Date Wed, 04 Nov 2015 15:40:16 GMT
How are you injesting them now?

I'd probably use Java8 with SolrJ and use new Virtual File System approach
to read right out of the zip and gzip .
http://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystems.html#newFileSystem-java.nio.file.Path-java.lang.ClassLoader-

Tar is a bit harder, there is apache commons that reads it, but probably
not in Java8 way. You may have to extract it into the memory buffer and
construct file from that.

But basically both tar and gzip are streaming formats, so you should be
able to do a single-pass through them with in-memory decompression.

Still, without knowing what you do, it is hard to tell where "slow" is
coming from.

Regards,
   Alex.

----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

On 4 November 2015 at 09:41, Frédéric Olier <FOlier@wooxo.fr> wrote:

> Hi,
>
>
>
> I have a ZIP (tar.gz) that contains many (> 100) other tar.gz files inside.
>
>
>
> Solr takes ages to ingest the document.
>
> I’d like to know if other users experienced with such a configuration and
> what the solution they found ?
>
>
>
> Is there a way to tell Solr to go ‘1 level deep’ while analysing the
> archive contents ?
>
> Is that the right approach ?
>
>
>
> Thanks for your response.
>
>
>
> F. OLIER.
>
>
>
>
>
>
>
>
>
> [image: TOP 250 des éditeurs]
> <http://miblink.letsignit.com/r/3808/5bf98bda-7098-42c9-aba2-bf0a530cdcc5/undefined>
>
> [image: Logo]
> <http://miblink.letsignit.com/r/1794/57f8dd12-c869-43e5-ad7b-c2feb68e8f01/undefined>
>
> *Suivez-nous !*
>
> [image: Linkedin]
> <http://miblink.letsignit.com/r/1795/a000215b-477c-4a54-a2ff-be46f99f3bff/undefined>
>
> [image: Viadeo]
> <http://miblink.letsignit.com/r/1796/e4eb6b07-d3cf-4f01-a6d4-07e463291ce7/undefined>
>
> [image: Twitter]
> <http://miblink.letsignit.com/r/1797/28a8d571-9ee6-41fa-a871-909f7fdc5be7/undefined>
>
> [image: Googleplus]
> <http://miblink.letsignit.com/r/2870/dbef1972-c4cd-4d3f-8be2-a3ffe1963204/undefined>
>
> *Frédéric OLIER** | Responsable de la planification stratégique*
>
> * 33 442 016 891 33 662 635 031*
>
> *WOOXO*
> Tél : 0811 140 160
> Fax0811 481 507
> Immeuble Le Forum - Bât A - 3ème étage
> 515 av. de la Tramontane
> ZAC Athélia IV
> 13600 LA CIOTAT
> FRANCE
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message