manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: SOLR
Date Tue, 15 Mar 2011 08:23:47 GMT
I've done what I think is the code for CONNECTORS-168 and attached a
patch to that ticket.  Perhaps you could try it on your setup to see
if the reporting of 500 errors improves.

Karl

On Tue, Mar 15, 2011 at 3:42 AM, Karl Wright <daddywri@gmail.com> wrote:
> It is hard to tell what you are seeing here because you need to also
> mention where you are seeing it.  But it is unlikely to be a result of
> the way the POST is being done within the Solr Connector; that
> connector does not perform any XML encoding, so that is not what is
> failing.  As I think you have discovered, it sounds like the problem
> is that somewhere deep in Solr something is going wrong and a 500
> error is being returned with non-XML contents.  The Solr Connector
> attempts to parse the response as XML and fails.  I;ve looked at the
> code; when this happens, a stack trace is dumped to stdout (which is
> not very helpful but is better than nothing).  Ideally, the connector
> should dump the response into the log (as part of a warning), and also
> write the raw response into the history (as part of the results of the
> indexing attempt).  So you should be able to see the actual error in
> the crawler UI by getting a simple history.  I've opened a new ticket
> (CONNECTORS-168) to capture this work.
>
> Other than that, I would hazard that there is currently nothing
> actually wrong with the Solr connector at this time.  There is an
> outstanding Jira ticket to port it to SolrJ (CONNECTORS-19), but based
> on how unreliable Solr has been of late maybe that's not such a great
> idea at the moment.  It's certainly in wide use at this time and
> people have not found an actual problem with it.
>
>
> Thanks,
> Karl
>
>
>
> On Mon, Mar 14, 2011 at 10:49 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>>
>> I just noticed:
>> Currently, default for ManifoldCF is /update/extract, which corresponds to
>> SOLR Cell request handler.
>>
>> So...
>> It is EXTREMELY generic...
>> http://wiki.apache.org/solr/ExtractingRequestHandler
>>
>> What happens is: we submit "field" which is HTML snippet (inside RSS), and
>> if that snippet is malformed... SOLR responds with error message such as
>> this:
>> <u>Unexpected character '
>> -' (code 45) in external DTD subset; expected closing '&gt;' after ENTITY
>> declaration  at [row,col,system-id]:
>> [81,5,&quot;http://www.w3.org/TR/html4/strict.dtd&quot;]
>>  from [row,col {unknown-source}]: [1,1]</u></p><p><b>description</b>
<u>The
>> request sent by the client was syntactically incorrect (Unexpected charact
>> er '-' (code 45) in external DTD subset; expected closing '&gt;' after
>> ENTITY declaration  at [row,col,system-id]:
>> [81,5,&quot;http://www.w3.org/TR/html4/strict.dtd&quot;]
>>
>> And, SOLR response is malformed too, so that we have
>> [Fatal Error] :7:112: The element type "HR" must be terminated by the
>> matching end-tag "</HR>".
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing
>> error: The element type "HR" must be terminated by the matching end-tag
>> "</HR>"
>>
>>
>> two exceptions:
>> 1. at SOLR because of malformed HTML such as
>> <my_rss_field>&gt;bold&lt;BOLD&gt/body&lt;</my_rss_field>
>> 2. at ManifoldCF, because SOLR response is malformed
>>
>>
>> Using SOLR Cell for RSS feeds... we probably need few types of SOLR
>> Connectors, or single type (but configurable); and it's much easier with
>> SOLRJ client... including troubleshooting... otherwise  we should have unit
>> tests for void writeField(OutputStream out, String fieldName, String
>> fieldValue) and etc......
>>
>>
>> I want to write new "connector" for my task, based on SOLRJ...
>>
>>
>> -Fuad
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Fuad Efendi [mailto:fuad@efendi.ca]
>> Sent: March-14-11 10:34 PM
>> To: connectors-user@incubator.apache.org
>> Subject: RE: SOLR
>>
>>
>> It's not trunk version; I use (different) trunk versions in few production
>> sites... in SOLR, path "/update" is defined in solrconfig.xml (and usually
>> user will copy it from "example" schema and "may be" modify):
>>
>>  <requestHandler name="/update"
>>                  class="solr.XmlUpdateRequestHandler">
>>
>>
>> And, what ManifoldCF expects, which kind of "update" handler?!!
>>
>> That's why I suggest to use SOLRJ API instead... I noticed a lot of
>> low-level coding...
>>
>>
>>
>> What kind of SOLR protocol is expected? It is definitely not POST of XML
>> content:
>>
>>
>>  /** Write a field */
>>  protected static void writeField(OutputStream out, String fieldName,
>> String fieldValue)
>>    throws IOException
>>  {
>>    writePreamble(out);
>>    writeBoundary(out,"text/plain; charset=UTF-8",fieldName,null);
>>
>>    byte[] tmp = fieldValue.getBytes("UTF-8");
>>    out.write(tmp, 0, tmp.length);
>>    writePostamble(out);
>>  }
>>
>>
>>
>> Do you expect "binary" handler on SOLR?
>>  <!-- Binary Update Request Handler
>>       http://wiki.apache.org/solr/javabin
>>    -->
>>  <requestHandler name="/update/javabin"
>>                  class="solr.BinaryUpdateRequestHandler" />
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Karl Wright [mailto:daddywri@gmail.com]
>> Sent: March-14-11 7:58 PM
>> To: connectors-user@incubator.apache.org
>> Subject: Re: SOLR
>>
>> The trunk version of Solr may have changed around how the extracting update
>> request handler works.  It changes daily, so there is no way I can keep up
>> with it.  Maybe it would be better to go back and use a known quantity.
>>
>> Thanks,
>> Karl
>>
>>
>> On Mon, Mar 14, 2011 at 6:24 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>>>
>>> Default settings for ManifoldCE: /update/extract
>>> http://localhost:8080/solr/update/extract?commit=true
>>>
>>> And using browser, I see SOLR responds with malformed HTML containing
>>> non-closing <HR>...
>>>
>>> Fix:
>>> Update handler:  /update
>>>
>>>
>>> -Fuad
>>>
>>>
>>> -----Original Message-----
>>> From: Fuad Efendi [mailto:fuad@efendi.ca]
>>> Sent: March-14-11 6:17 PM
>>> To: connectors-user@incubator.apache.org
>>> Subject: RE: SOLR
>>>
>>> Hi Karl,
>>>
>>> I verified (via browser),
>>> http://localhost:8080/solr/update?commit=true
>>>
>>> And response from SOLR:
>>> <?xml version="1.0" encoding="UTF-8"?> <response> <lst
>>> name="responseHeader"><int name="status">0</int><int
>>> name="QTime">15</int></lst> </response>
>>>
>>> The problem root is
>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(H
>>> ttpPos
>>> ter.java:1658)
>>>
>>>
>>> Everything is fine except I can't understand why we have "HR" from
>>> SOLR, do we have any multithreading issues? I believe I connect to
>>> SOLR, port 8080 is configured via console... may be somewhere else?
>>>
>>> I believe default setting for "Update handler:" at Connector screen is
>>> incorrect, it is /update/extract
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Karl Wright [mailto:daddywri@gmail.com]
>>> Sent: March-14-11 6:00 PM
>>> To: connectors-user@incubator.apache.org
>>> Subject: Re: SOLR
>>>
>>> This is because your solr setup is incorrect.  The post to "solr" is
>>> returning HTML, not XML, so you are not actually communicating with
>>> Solr at all.
>>>
>>> In order for the Solr connector to work, you need to have the solr
>>> extracting update request handler present and configured.  I am told
>>> that the latest release of Solr makes the jar with this code optional
>>> - it's a contrib jar that you have to separately download.  If you are
>>> building solr off of trunk, then this should not be a problem.
>>>
>>> Kalr
>>>
>>> On Mon, Mar 14, 2011 at 5:40 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>>>> This exception, XML contains encoded HTML, and it doesn't happen with
>>>> standard Java 6 StAX parser:
>>>>
>>>> [Fatal Error] :124:120: The element type "HR" must be terminated by
>>>> the matching end-tag "</HR>".
>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML
>>>> parsing
>>>> error: The element type "HR" must be terminated by the matching
>>>> end-tag "</HR>"
>>>> .
>>>>        at
>>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369)
>>>>        at
>>>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317)
>>>>        at
>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster.getResponse(HttpP
>>>> o
>>>> ster.j
>>>> ava:619)
>>>>        at
>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(
>>>> H
>>>> ttpPos
>>>> ter.java:1658)
>>>> Caused by: org.xml.sax.SAXParseException: The element type "HR" must
>>>> be terminated by the matching end-tag "</HR>".
>>>>        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>>>>        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown
>>>> Source)
>>>>        at
>>>> javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
>>>>        at
>>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:365)
>>>>        ... 3 more
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Fuad Efendi [mailto:fuad@efendi.ca]
>>>> Sent: March-14-11 5:37 PM
>>>> To: connectors-user@incubator.apache.org
>>>> Subject: RE: SOLR
>>>>
>>>> Thank you very much Karl,
>>>>
>>>> And I have first problem,
>>>> Starting crawler...
>>>> [Fatal Error] :124:120: The element type "HR" must be terminated by
>>>> the matching end-tag "</HR>".
>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML
>>>> parsing
>>>> error: The element type "HR" must be terminated by the matching
>>>> end-tag "</HR>"
>>>> .
>>>>        at
>>>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369)
>>>>        at
>>>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317)
>>>>
>>>> I am using RSS connector to crawl specific XML (containing
>>>> XML-encoded &gt;HR&lt; and other HTML tags). It doesn't happened
with
>>>> standard StAX parser (Java 6)...
>>>>
>>>>
>>>> Regarding (2), do you mean this interface method?
>>>>  /** View specification.
>>>>  * This method is called in the body section of a job's view page.
>>>> Its purpose is to present the output specification information to the
>>> user.
>>>>  * The coder can presume that the HTML that is output from this
>>>> configuration will be within appropriate <html> and <body> tags.
>>>>  *@param out is the output to which any HTML should be sent.
>>>>  *@param os is the current output specification for this job.
>>>>  */
>>>>  public void viewSpecification(IHTTPOutput out, OutputSpecification
>>>> os)
>>>>    throws ManifoldCFException, IOException
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Karl Wright [mailto:daddywri@gmail.com]
>>>> Sent: March-14-11 5:21 PM
>>>> To: connectors-user@incubator.apache.org
>>>> Subject: Re: SOLR
>>>>
>>>> Hi Fuad,
>>>>
>>>> (1) "Arguments" are indeed optional key/value pairs, which are sent
>>>> to solr as part of the URL.
>>>> (2) ManifoldCF presents tabs for a job of three kinds: (a) tabs that
>>>> all jobs have; (b) tabs related to the repository connector's
>>>> management of the document specification information; and (c) tabs
>>>> related to the output connector's output specification information.
>>>> The Solr output connector's output specification information includes
>>>> the metadata to solr mapping, so those tabs come from the Solr connector.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Mon, Mar 14, 2011 at 4:51 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>>>>> Hi, any sample of how to use SOLR connector?
>>>>>
>>>>> http://incubator.apache.org/connectors/end-user-documentation.html#s
>>>>> o
>>>>> l
>>>>> routputconnector
>>>>>
>>>>>
>>>>>
>>>>> Some questions:
>>>>>
>>>>>
>>>>>
>>>>> 1.       Argument. Is it optional key=value pairs which can be
sent
>>>>> to SOLR as part of HTTP GET/POST request?
>>>>>
>>>>> 2.       I see code for “Connector”, and I see how to configure
SOLR
>>>>> Output Connection. But how “Job” happens to know about <metadata>
to
>>>>> <solr> mapping, is it generic (without dependency on SOLR)?
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Fuad
>>>>
>>>>
>>>
>>>
>>
>>
>

Mime
View raw message