manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: SOLR
Date Tue, 15 Mar 2011 02:34:26 GMT

It's not trunk version; I use (different) trunk versions in few production
sites... in SOLR, path "/update" is defined in solrconfig.xml (and usually
user will copy it from "example" schema and "may be" modify):

  <requestHandler name="/update" 
                  class="solr.XmlUpdateRequestHandler">


And, what ManifoldCF expects, which kind of "update" handler?!!

That's why I suggest to use SOLRJ API instead... I noticed a lot of
low-level coding... 



What kind of SOLR protocol is expected? It is definitely not POST of XML
content:


  /** Write a field */
  protected static void writeField(OutputStream out, String fieldName,
String fieldValue)
    throws IOException
  {
    writePreamble(out);
    writeBoundary(out,"text/plain; charset=UTF-8",fieldName,null);
    
    byte[] tmp = fieldValue.getBytes("UTF-8");
    out.write(tmp, 0, tmp.length);
    writePostamble(out);
  }



Do you expect "binary" handler on SOLR?
  <!-- Binary Update Request Handler
       http://wiki.apache.org/solr/javabin
    -->
  <requestHandler name="/update/javabin" 
                  class="solr.BinaryUpdateRequestHandler" />






-----Original Message-----
From: Karl Wright [mailto:daddywri@gmail.com] 
Sent: March-14-11 7:58 PM
To: connectors-user@incubator.apache.org
Subject: Re: SOLR

The trunk version of Solr may have changed around how the extracting update
request handler works.  It changes daily, so there is no way I can keep up
with it.  Maybe it would be better to go back and use a known quantity.

Thanks,
Karl


On Mon, Mar 14, 2011 at 6:24 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>
> Default settings for ManifoldCE: /update/extract 
> http://localhost:8080/solr/update/extract?commit=true
>
> And using browser, I see SOLR responds with malformed HTML containing 
> non-closing <HR>...
>
> Fix:
> Update handler:  /update
>
>
> -Fuad
>
>
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: March-14-11 6:17 PM
> To: connectors-user@incubator.apache.org
> Subject: RE: SOLR
>
> Hi Karl,
>
> I verified (via browser),
> http://localhost:8080/solr/update?commit=true
>
> And response from SOLR:
> <?xml version="1.0" encoding="UTF-8"?> <response> <lst 
> name="responseHeader"><int name="status">0</int><int 
> name="QTime">15</int></lst> </response>
>
> The problem root is
> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(H
> ttpPos
> ter.java:1658)
>
>
> Everything is fine except I can't understand why we have "HR" from 
> SOLR, do we have any multithreading issues? I believe I connect to 
> SOLR, port 8080 is configured via console... may be somewhere else?
>
> I believe default setting for "Update handler:" at Connector screen is 
> incorrect, it is /update/extract
>
>
>
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: March-14-11 6:00 PM
> To: connectors-user@incubator.apache.org
> Subject: Re: SOLR
>
> This is because your solr setup is incorrect.  The post to "solr" is 
> returning HTML, not XML, so you are not actually communicating with 
> Solr at all.
>
> In order for the Solr connector to work, you need to have the solr 
> extracting update request handler present and configured.  I am told 
> that the latest release of Solr makes the jar with this code optional
> - it's a contrib jar that you have to separately download.  If you are 
> building solr off of trunk, then this should not be a problem.
>
> Kalr
>
> On Mon, Mar 14, 2011 at 5:40 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>> This exception, XML contains encoded HTML, and it doesn't happen with 
>> standard Java 6 StAX parser:
>>
>> [Fatal Error] :124:120: The element type "HR" must be terminated by 
>> the matching end-tag "</HR>".
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML 
>> parsing
>> error: The element type "HR" must be terminated by the matching 
>> end-tag "</HR>"
>> .
>>        at
>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369)
>>        at
>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317)
>>        at
>> org.apache.manifoldcf.agents.output.solr.HttpPoster.getResponse(HttpP
>> o
>> ster.j
>> ava:619)
>>        at
>> org.apache.manifoldcf.agents.output.solr.HttpPoster$CommitThread.run(
>> H
>> ttpPos
>> ter.java:1658)
>> Caused by: org.xml.sax.SAXParseException: The element type "HR" must 
>> be terminated by the matching end-tag "</HR>".
>>        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>>        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown
>> Source)
>>        at
>> javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
>>        at
>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:365)
>>        ... 3 more
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Fuad Efendi [mailto:fuad@efendi.ca]
>> Sent: March-14-11 5:37 PM
>> To: connectors-user@incubator.apache.org
>> Subject: RE: SOLR
>>
>> Thank you very much Karl,
>>
>> And I have first problem,
>> Starting crawler...
>> [Fatal Error] :124:120: The element type "HR" must be terminated by 
>> the matching end-tag "</HR>".
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML 
>> parsing
>> error: The element type "HR" must be terminated by the matching 
>> end-tag "</HR>"
>> .
>>        at
>> org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:369)
>>        at
>> org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:317)
>>
>> I am using RSS connector to crawl specific XML (containing 
>> XML-encoded &gt;HR&lt; and other HTML tags). It doesn't happened with 
>> standard StAX parser (Java 6)...
>>
>>
>> Regarding (2), do you mean this interface method?
>>  /** View specification.
>>  * This method is called in the body section of a job's view page.
>> Its purpose is to present the output specification information to the
> user.
>>  * The coder can presume that the HTML that is output from this 
>> configuration will be within appropriate <html> and <body> tags.
>>  *@param out is the output to which any HTML should be sent.
>>  *@param os is the current output specification for this job.
>>  */
>>  public void viewSpecification(IHTTPOutput out, OutputSpecification
>> os)
>>    throws ManifoldCFException, IOException
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Karl Wright [mailto:daddywri@gmail.com]
>> Sent: March-14-11 5:21 PM
>> To: connectors-user@incubator.apache.org
>> Subject: Re: SOLR
>>
>> Hi Fuad,
>>
>> (1) "Arguments" are indeed optional key/value pairs, which are sent 
>> to solr as part of the URL.
>> (2) ManifoldCF presents tabs for a job of three kinds: (a) tabs that 
>> all jobs have; (b) tabs related to the repository connector's 
>> management of the document specification information; and (c) tabs 
>> related to the output connector's output specification information.
>> The Solr output connector's output specification information includes 
>> the metadata to solr mapping, so those tabs come from the Solr connector.
>>
>> Karl
>>
>>
>> On Mon, Mar 14, 2011 at 4:51 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>>> Hi, any sample of how to use SOLR connector?
>>>
>>> http://incubator.apache.org/connectors/end-user-documentation.html#s
>>> o
>>> l
>>> routputconnector
>>>
>>>
>>>
>>> Some questions:
>>>
>>>
>>>
>>> 1.       Argument. Is it optional key=value pairs which can be sent 
>>> to SOLR as part of HTTP GET/POST request?
>>>
>>> 2.       I see code for “Connector”, and I see how to configure SOLR

>>> Output Connection. But how “Job” happens to know about <metadata> to

>>> <solr> mapping, is it generic (without dependency on SOLR)?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Fuad
>>
>>
>
>


Mime
View raw message