manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [VOTE] Release Apache ManifoldCF 1.7 RC0
Date Tue, 12 Aug 2014 15:24:17 GMT
Ok, I've done some more experimentation, and confirmed that there is really
only ONE problem: in SolrJ or Solr.  ManifoldCF is working perfectly.

The ticket I created, CONNECTORS-1008, will therefore be postponed to MCF
2.0.  The workaround is the use the extracting update handler even when the
content has already been extracted on the MCF side.  So we should open a
SOLR ticket, but there is no reason to respin the MCF release.

Karl



On Tue, Aug 12, 2014 at 10:18 AM, Karl Wright <daddywri@gmail.com> wrote:

> So there are two problems.  One problem is that the Tika Extractor is not
> doing the right thing (I think).  The second problem is that valid
> characters are not being sent to Solr when SolrInputDocument is used.
>
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 10:15 AM, Shinichiro Abe <
> shinichiro.abe.1@gmail.com> wrote:
>
>> Thanks Karl,
>>
>> When posting MCF's end-user-documentation.pdf(English) via standard update
>> handler,
>> Solr throws an exception, this is a problem, I'm not sure why.
>> It works by leaving my pipeline to include Tika and using the extracting
>> update handler.
>> Solr's Tika version matches MCF's Tika one(1.5).
>>
>>
>>
>> 2014-08-12 23:10 GMT+09:00 Karl Wright <daddywri@gmail.com>:
>>
>> > It looks like the Tika content extraction is not actually producing
>> valid
>> > utf-8.  I'm not sure what it is producing, but that is the underlying
>> > problem.
>> >
>> > I'll create a ticket and look into it.
>> >
>> > Karl
>> >
>> >
>> >
>> > On Tue, Aug 12, 2014 at 9:52 AM, Karl Wright <daddywri@gmail.com>
>> wrote:
>> >
>> > > Hi Abe-san,
>> > >
>> > > It looks to me like SolrJ when it uses SolrInputDocument cannot
>> correctly
>> > > post some kinds of characters.  The exception is coming from inside
>> Solr
>> > > itself -- not SolrJ.  So I think a Solr ticket would be the right
>> thing
>> > to
>> > > do here.
>> > >
>> > > Can you try leaving your pipeline to include Tika, but changing your
>> Solr
>> > > connection to go back to using the extracting update handler?  If that
>> > > works, then I think we have correctly diagnosed the problem.
>> > >
>> > > Thanks,
>> > > Karl
>> > >
>> > >
>> > >
>> > > On Tue, Aug 12, 2014 at 9:43 AM, Shinichiro Abe <
>> > > shinichiro.abe.1@gmail.com> wrote:
>> > >
>> > >> Hi Karl,
>> > >>
>> > >> The content field was garbled via /update and tika connector.
>> > >> Sample Docs: http://www.rondhuit.com/download.html#whitepaper
>> > >> My mcf-job was from filesystem:Japanese PDF,XLS to Solr.
>> > >>
>> > >> I was surprised that Solr threw an exception when
>> > >> en_US end-user-documentation.pdf
>> > >> was posted via tika connector. Posting files via /update/extract were
>> > not
>> > >> garbled, not threw exceptions.
>> > >> Could you reproduce this?
>> > >>
>> > >> 2268394 [qtp1224864813-14] ERROR
>> > >> org.apache.solr.servlet.SolrDispatchFilter
>> > >>  – null:java.lang.RuntimeException: [was class
>> > >> java.io.CharConversionException] Invalid UTF-8 character 0xffff at
>> char
>> > >> #112515, byte #184319)
>> > >> at
>> > >>
>> > >>
>> >
>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>> > >> at
>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>> > >> at
>> > >>
>> > >>
>> >
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>> > >> at
>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>> > >> at
>> org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:395)
>> > >> at
>> > >>
>> >
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>> > >> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
>> > >> at
>> > >>
>> > >>
>> >
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> > >> at
>> > >>
>> > >>
>> >
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> > >> at
>> > >>
>> > >>
>> >
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> > >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>> > >> ...
>> > >> Caused by: java.io.CharConversionException: Invalid UTF-8 character
>> > 0xffff
>> > >> at char #112515, byte #184319)
>> > >> at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
>> > >> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
>> > >> at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
>> > >> at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
>> > >> at
>> > >>
>> > >>
>> >
>> com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
>> > >> at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
>> > >> at
>> > >>
>> > >>
>> >
>> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
>> > >> at
>> > >>
>> > >>
>> >
>> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
>> > >> at
>> > >>
>> >
>> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
>> > >> at
>> > >>
>> > >>
>> >
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
>> > >> ... 36 more
>> > >>
>> > >> Thanks,
>> > >> Shinichiro Abe
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> 2014-08-12 22:24 GMT+09:00 Karl Wright <daddywri@gmail.com>:
>> > >>
>> > >> > I ran "ant rat-sources", and inspected the packages.  All looks
>> good.
>> > >>  The
>> > >> > only comment is that the connector-lib area has grown by about
18MB
>> > this
>> > >> > cycle, and of course all the images for the Chinese documentation
>> add
>> > >> > another 5MB, so our binary packages are now just about 200MB.
 I
>> don't
>> > >> > think this something we can do a lot about, though, except maybe
by
>> > >> > repackaging so we release connectors independently of the
>> framework.
>> > >> >
>> > >> > I'll give a final vote after I hear more back from Erlend and
>> Abe-san.
>> > >> >
>> > >> > Thanks,
>> > >> > Karl
>> > >> >
>> > >> >
>> > >> > On Tue, Aug 12, 2014 at 2:23 AM, Karl Wright <daddywri@gmail.com>
>> > >> wrote:
>> > >> >
>> > >> > > I request that the vote be left open at least until 8/21/2014,
>> since
>> > >> 1.7
>> > >> > > is a major release and we want as many people to try it out
as
>> > >> possible
>> > >> > > before declaring it complete.  Thanks!
>> > >> > >
>> > >> > > Karl
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > On Tue, Aug 12, 2014 at 12:44 AM, Shinichiro Abe <
>> > >> > > shinichiro.abe.1@gmail.com> wrote:
>> > >> > >
>> > >> > >> Hi,
>> > >> > >>
>> > >> > >> +1 from me.
>> > >> > >>
>> > >> > >> -Checked SIGS, checksum by running check_signatures.sh.
>> > >> > >> -Checked that the code signing Key of Mingchun is available
>> online.
>> > >> > >>
>> > >> > >> Shinichiro Abe
>> > >> > >>
>> > >> > >> On 2014/08/12, at 12:13, Mingchun Zhao <
>> mingchun.zhao.2@gmail.com>
>> > >> > wrote:
>> > >> > >>
>> > >> > >> > Hi all,
>> > >> > >> >
>> > >> > >> > Please vote on whether to release the ManifoldCF,
version 1.7,
>> > RC0.
>> > >> > >> >
>> > >> > >> > You can find the artifact at:
>> > >> > >> >
>> > >> > >> > http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0
>> > >> > >> >
>> > >> > >> > There is also a tag at:
>> > >> > >> >
>> > >> > >> >
>> https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0
>> > >> > >> >
>> > >> > >> > Vote will remain open at least 72 hours.
>> > >> > >> >
>> > >> > >> > Thanks!
>> > >> > >> > Mingchun Zhao
>> > >> > >>
>> > >> > >>
>> > >> > >
>> > >> >
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> > >> Shinichiro Abe
>> > >> 阿部 慎一朗
>> > >>
>> > >
>> > >
>> >
>>
>>
>>
>> --
>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> Shinichiro Abe
>> 阿部 慎一朗
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message