manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [VOTE] Release Apache ManifoldCF 1.7 RC0
Date Tue, 12 Aug 2014 14:10:56 GMT
It looks like the Tika content extraction is not actually producing valid
utf-8.  I'm not sure what it is producing, but that is the underlying
problem.

I'll create a ticket and look into it.

Karl



On Tue, Aug 12, 2014 at 9:52 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Abe-san,
>
> It looks to me like SolrJ when it uses SolrInputDocument cannot correctly
> post some kinds of characters.  The exception is coming from inside Solr
> itself -- not SolrJ.  So I think a Solr ticket would be the right thing to
> do here.
>
> Can you try leaving your pipeline to include Tika, but changing your Solr
> connection to go back to using the extracting update handler?  If that
> works, then I think we have correctly diagnosed the problem.
>
> Thanks,
> Karl
>
>
>
> On Tue, Aug 12, 2014 at 9:43 AM, Shinichiro Abe <
> shinichiro.abe.1@gmail.com> wrote:
>
>> Hi Karl,
>>
>> The content field was garbled via /update and tika connector.
>> Sample Docs: http://www.rondhuit.com/download.html#whitepaper
>> My mcf-job was from filesystem:Japanese PDF,XLS to Solr.
>>
>> I was surprised that Solr threw an exception when
>> en_US end-user-documentation.pdf
>> was posted via tika connector. Posting files via /update/extract were not
>> garbled, not threw exceptions.
>> Could you reproduce this?
>>
>> 2268394 [qtp1224864813-14] ERROR
>> org.apache.solr.servlet.SolrDispatchFilter
>>  – null:java.lang.RuntimeException: [was class
>> java.io.CharConversionException] Invalid UTF-8 character 0xffff at char
>> #112515, byte #184319)
>> at
>>
>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>> at
>>
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>> at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>> at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:395)
>> at
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
>> at
>>
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> at
>>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> at
>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>> ...
>> Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff
>> at char #112515, byte #184319)
>> at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
>> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
>> at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
>> at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
>> at
>>
>> com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
>> at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
>> at
>>
>> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
>> at
>>
>> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
>> at
>> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
>> at
>>
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
>> ... 36 more
>>
>> Thanks,
>> Shinichiro Abe
>>
>>
>>
>>
>> 2014-08-12 22:24 GMT+09:00 Karl Wright <daddywri@gmail.com>:
>>
>> > I ran "ant rat-sources", and inspected the packages.  All looks good.
>>  The
>> > only comment is that the connector-lib area has grown by about 18MB this
>> > cycle, and of course all the images for the Chinese documentation add
>> > another 5MB, so our binary packages are now just about 200MB.  I don't
>> > think this something we can do a lot about, though, except maybe by
>> > repackaging so we release connectors independently of the framework.
>> >
>> > I'll give a final vote after I hear more back from Erlend and Abe-san.
>> >
>> > Thanks,
>> > Karl
>> >
>> >
>> > On Tue, Aug 12, 2014 at 2:23 AM, Karl Wright <daddywri@gmail.com>
>> wrote:
>> >
>> > > I request that the vote be left open at least until 8/21/2014, since
>> 1.7
>> > > is a major release and we want as many people to try it out as
>> possible
>> > > before declaring it complete.  Thanks!
>> > >
>> > > Karl
>> > >
>> > >
>> > >
>> > > On Tue, Aug 12, 2014 at 12:44 AM, Shinichiro Abe <
>> > > shinichiro.abe.1@gmail.com> wrote:
>> > >
>> > >> Hi,
>> > >>
>> > >> +1 from me.
>> > >>
>> > >> -Checked SIGS, checksum by running check_signatures.sh.
>> > >> -Checked that the code signing Key of Mingchun is available online.
>> > >>
>> > >> Shinichiro Abe
>> > >>
>> > >> On 2014/08/12, at 12:13, Mingchun Zhao <mingchun.zhao.2@gmail.com>
>> > wrote:
>> > >>
>> > >> > Hi all,
>> > >> >
>> > >> > Please vote on whether to release the ManifoldCF, version 1.7,
RC0.
>> > >> >
>> > >> > You can find the artifact at:
>> > >> >
>> > >> > http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0
>> > >> >
>> > >> > There is also a tag at:
>> > >> >
>> > >> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0
>> > >> >
>> > >> > Vote will remain open at least 72 hours.
>> > >> >
>> > >> > Thanks!
>> > >> > Mingchun Zhao
>> > >>
>> > >>
>> > >
>> >
>>
>>
>>
>> --
>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> Shinichiro Abe
>> 阿部 慎一朗
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message