lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Re: Solr Cell Questions
Date Tue, 25 Sep 2012 13:47:34 GMT
bq: how many documents per minute, second, what ever can i put into solr

Too many variables to say. I've seen several thousand truly simple
docs/sec. But since you're doing the Tika processing that's probably
going to be your limiting factor. And it'll be many fewer...

I don't understand your OOM issue when running Tika on the client. Or,
rather, why you think using SolrCell makes this different. SolrCell also
uses Tika. So my suspicion it that your client-side process simply isn't
allocating much memory to the JVM, did you try bumping the memory
on your client?

Best
Erick

On Tue, Sep 25, 2012 at 5:23 AM,  <Johannes.Schwendinger@blum.com> wrote:
> Thank you Erick for your respone,
>
> I've already tried what you've suggested and got some out of memory
> exceptions. Because of this i like the solution with solr Cell where i can
> send the file directly to solr via stream and don't collect them in my
> memory.
>
> And another question that came to my mind, how many documents per minute,
> second, what ever can i put into solr. Say XML format and from 100kb to
> 100MB.
> Is there a number or is it to dependent from hardware and settings?
>
>
> Best
> Johannes
>
> Erick Erickson <erickerickson@gmail.com> schrieb am 25.09.2012 00:22:26:
>
>> Von:
>>
>> Erick Erickson <erickerickson@gmail.com>
>>
>> An:
>>
>> solr-user@lucene.apache.org
>>
>> Datum:
>>
>> 25.09.2012 00:23
>>
>> Betreff:
>>
>> Re: Solr Cell Questions
>>
>> If you're concerned about throughput, consider moving all the
>> SolrCell (Tika) processing off the server. SolrCell is way cool
>> for showing what can be done, but its downside is you're
>> moving all the processing of the structured documents to the
>> same machine doing the indexing. Pretty soon, especially
>> with significant size files, you're spending all your CPU cycles
>> parsing the files...
>>
>> Happens there's a blog about this:
>> http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
>>
>> By moving the indexing to N clients, you can increase
>> throughput until you make Solr work hard to do the indexing....
>>
>> Best
>> Erick
>>
>> On Mon, Sep 24, 2012 at 10:04 AM,  <Johannes.Schwendinger@blum.com>
> wrote:
>> > Hi,
>> >
>> > Im currently experimenting with Solr Cell to index files to Solr.
> During
>> > this some questions came up.
>> >
>> > 1. Is it possible (and wise) to connect to Solr Cell with multiple
> Threads
>> > at the same time to index several documents at the same time?
>> > This question came up because my prrogramm takes about 6hours to index
>> > round 35000 docs. (no production environment, only example solr and a
>> > little desktop machine but I think its very slow, and I know solr
> isn't
>> > the bottleneck (yet))
>> >
>> > 2. If 1 is possible, how many Threads should do this and how many
> memory
>> > Solr needs? I've tried it but i run into an out of memory exception.
>> >
>> > Thanks in advantage
>> >
>> > Best Regards
>> > Johannes

Mime
View raw message