lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tarala, Magesh" <MTar...@bh.com>
Subject RE: Solr cloud error during document ingestion
Date Sun, 12 Jul 2015 20:56:11 GMT
I'm using solrj to ingest the documents. But I'm using only one client now. 

Yes Erick, it is weird. You are right, it is already in UTF-8 and so even if I convert it
explicitly it is the same string and so, same issue... I'm still stumped:(

The code is straight forward. I am using Tika and solrj. 
Creating Tika AutoDetectParser(), getting the BodyContentHandler, creating a SafeContentHandler,
and getting the content as string. 
Then passing the string to SolrInputDocument
The error is caused when the SolrInputDocument is sent to solr. 

One other piece of info.. We are creating nested documents.


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Sunday, July 12, 2015 1:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr cloud error during document ingestion

How are you ingesting documents? ExtractingRequestHandler? That loads all the work to the
Solr node(s) you might want to consider using SolrJ as that gives you much more control as
well as the ability to farm out the work to N clients.

Another blog:
https://lucidworks.com/blog/indexing-with-solrj/

 Best,
Erick

P.S. Glad you found the problem, but it's a little weird. Solr already talks UTF-8 so this
should "just work", but then I'm not familiar with all the details of your setup.



On Sun, Jul 12, 2015 at 10:11 AM, Tarala, Magesh <MTarala@bh.com> wrote:
> I narrowed down the cause. And it is a character issue!
>
> The .msg file content I'm extracting using Tika parser has this text 
> (daƱos) If I remove the character n with the tilde, it works.
>
> Explicitly convert to UTF-8 before sending it to solr?
>
> Erick - I'm in the QA phase. I'll be ingesting around 800K documents total (word, pdf,
excel, .msg, txt, etc.) For now I'm considering daily updates when we first go to prod end
of month. i.e., capture all the new and modified documents on a daily basis and update solr.
Once we get a grasp of things, we want to go near real time. Thanks for the link to your post.
It is very helpful.
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Sunday, July 12, 2015 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr cloud error during document ingestion
>
> Probably not related to your problem, but if you're sending lots of docs at Solr, committing
every 100 is very aggressive.
> I'm assuming you're committing from the client, which, while OK 
> doesn't scale very well if you ever decide to have more than
> 1 client sending docs.
>
> I'd recommend setting your hard commit to a minute or so and just leaving it at that
if possible, with soft committing to make the docs visible.
>
> Here's more than you ever wanted to know about soft commits, hard commits and such:
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-
> and-commit-in-sorlcloud/
>
> Best,
> Erick
>
> On Sun, Jul 12, 2015 at 8:40 AM, Mikhail Khludnev <mkhludnev@griddynamics.com>
wrote:
>> I suggest to check
>> http://10.222.238.35:8983/solr/serviceorder_shard1_replica2
>> <http://10.222.238.35:8983/solr/serviceorder_shard1_replica2/update?u
>> p
>> date.distrib=TOLEADER&distrib.from=http%3A%2F%2F10.222.238.36%3A8983%
>> 2 Fsolr%2Fserviceorder_shard2_replica1%2F&wt=javabin&version=2>
>> logs to find root cause.
>>
>> On Sun, Jul 12, 2015 at 6:33 AM, Tarala, Magesh <MTarala@bh.com> wrote:
>>
>>> I'm using 4.10.2 in a 3 node solr cloud setup I have a collection 
>>> with 3 shards and 2 replicas each.
>>> I'm ingesting solr documents via solrj.
>>>
>>> While ingesting the documents, I get the following error:
>>>
>>> 264147944 [updateExecutor-1-thread-268] ERROR 
>>> org.apache.solr.update.StreamingSolrServers  ? error
>>> org.apache.solr.common.SolrException: Bad Request
>>>
>>> request:
>>> http://10.222.238.35:8983/solr/serviceorder_shard1_replica2/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F10.222.238.36%3A8983%2Fsolr%2Fserviceorder_shard2_replica1%2F&wt=javabin&version=2
>>>         at
>>> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>        at java.lang.Thread.run(Thread.java:745)
>>>
>>> I commit after every 100 documents in solrj.
>>> And I also have the following solrconfig.xml setting:
>>>      <autoCommit>
>>>        <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
>>>        <openSearcher>false</openSearcher>
>>>      </autoCommit>
>>>
>>>
>>> IMO, tlogs (for serviceorder_shard1_replica2) are not too big
>>> -rw-r--r-- 1 solr users  8338 Jul 11 21:40 tlog.0000000000000000364
>>> -rw-r--r-- 1 solr users  6385 Jul 11 21:40 tlog.0000000000000000365
>>> -rw-r--r-- 1 solr users 10221 Jul 11 21:41 tlog.0000000000000000366
>>> -rw-r--r-- 1 solr users  5981 Jul 11 21:41 tlog.0000000000000000367
>>> -rw-r--r-- 1 solr users  2682 Jul 11 21:41 tlog.0000000000000000368
>>> -rw-r--r-- 1 solr users  8515 Jul 11 21:42 tlog.0000000000000000369
>>> -rw-r--r-- 1 solr users  7373 Jul 11 21:42 tlog.0000000000000000370
>>> -rw-r--r-- 1 solr users  6907 Jul 11 21:42 tlog.0000000000000000371
>>> -rw-r--r-- 1 solr users  5524 Jul 11 21:42 tlog.0000000000000000372
>>> -rw-r--r-- 1 solr users  5600 Jul 11 21:43 tlog.0000000000000000373
>>>
>>>
>>> So far I've not been able to resolve this issue. Any ideas / 
>>> pointers would be greatly appreciated!
>>>
>>> Thanks,
>>> Magesh
>>>
>>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>> <http://www.griddynamics.com>
>> <mkhludnev@griddynamics.com>
Mime
View raw message