manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [VOTE] Release Apache ManifoldCF 1.7 RC0
Date Tue, 12 Aug 2014 16:25:10 GMT
bq. I have a question.
What is this? -> hard-coded mymetype checkings, "text/plain;charset=utf-8".
For what? This seems to be unnecessary.
http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l156
http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l99


Hi Abe-san,

The idea is that the Tika extractor always confirms that the downstream
pipeline accepts text/plain;charset=utf-8 because that is what it always
outputs.  On the upstream side, we should technically only accept documents
that Tika knows how to extract.  Right now, we accept all kinds, because I
don't know what that list is.

Karl




On Tue, Aug 12, 2014 at 12:20 PM, Shinichiro Abe <shinichiro.abe.1@gmail.com
> wrote:

> Hi Karl,
>
> I also confirmed that using a SJIS file attached on CONNECTORS-613,
> then the file was not garbled and could extract content and metadata
> properly by tika connector.
> Therefore currently we don't need to respin RC.
>
> I have a question.
> What is this? -> hard-coded mymetype checkings, "text/plain;charset=utf-8".
> For what? This seems to be unnecessary.
>
> http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l156
>
> http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l99
>
> Thanks,
> Shinichiro Abe
>
>
> 2014-08-13 1:09 GMT+09:00 Karl Wright <daddywri@gmail.com>:
>
> > Ok, I closed the ticket.
> >
> > So thanks, I think I'm now read to vote +1.
> >
> > Karl
> >
> >
> >
> > On Tue, Aug 12, 2014 at 11:38 AM, Shinichiro Abe <
> > shinichiro.abe.1@gmail.com
> > > wrote:
> >
> > > I apologize for the mistake, I forgot to configure tika connector in
> the
> > > job. I configured documentFilter and Metadata adjuster only.
> > > It works by adding tika connector, there is no problem. English pdf,
> > > Japanese pdf/xls are not garbled!
> > > I'm sorry! So we don't have to fix CONNECTORS-1008.
> > >
> > > Shinichiro Abe
> > >
> > >
> > > 2014-08-13 0:24 GMT+09:00 Karl Wright <daddywri@gmail.com>:
> > >
> > > > Ok, I've done some more experimentation, and confirmed that there is
> > > really
> > > > only ONE problem: in SolrJ or Solr.  ManifoldCF is working perfectly.
> > > >
> > > > The ticket I created, CONNECTORS-1008, will therefore be postponed to
> > MCF
> > > > 2.0.  The workaround is the use the extracting update handler even
> when
> > > the
> > > > content has already been extracted on the MCF side.  So we should
> open
> > a
> > > > SOLR ticket, but there is no reason to respin the MCF release.
> > > >
> > > > Karl
> > > >
> > > >
> > > >
> > > > On Tue, Aug 12, 2014 at 10:18 AM, Karl Wright <daddywri@gmail.com>
> > > wrote:
> > > >
> > > > > So there are two problems.  One problem is that the Tika Extractor
> is
> > > not
> > > > > doing the right thing (I think).  The second problem is that valid
> > > > > characters are not being sent to Solr when SolrInputDocument is
> used.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Aug 12, 2014 at 10:15 AM, Shinichiro Abe <
> > > > > shinichiro.abe.1@gmail.com> wrote:
> > > > >
> > > > >> Thanks Karl,
> > > > >>
> > > > >> When posting MCF's end-user-documentation.pdf(English) via
> standard
> > > > update
> > > > >> handler,
> > > > >> Solr throws an exception, this is a problem, I'm not sure why.
> > > > >> It works by leaving my pipeline to include Tika and using the
> > > extracting
> > > > >> update handler.
> > > > >> Solr's Tika version matches MCF's Tika one(1.5).
> > > > >>
> > > > >>
> > > > >>
> > > > >> 2014-08-12 23:10 GMT+09:00 Karl Wright <daddywri@gmail.com>:
> > > > >>
> > > > >> > It looks like the Tika content extraction is not actually
> > producing
> > > > >> valid
> > > > >> > utf-8.  I'm not sure what it is producing, but that is the
> > > underlying
> > > > >> > problem.
> > > > >> >
> > > > >> > I'll create a ticket and look into it.
> > > > >> >
> > > > >> > Karl
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Aug 12, 2014 at 9:52 AM, Karl Wright <
> daddywri@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > > Hi Abe-san,
> > > > >> > >
> > > > >> > > It looks to me like SolrJ when it uses SolrInputDocument
> cannot
> > > > >> correctly
> > > > >> > > post some kinds of characters.  The exception is coming
from
> > > inside
> > > > >> Solr
> > > > >> > > itself -- not SolrJ.  So I think a Solr ticket would
be the
> > right
> > > > >> thing
> > > > >> > to
> > > > >> > > do here.
> > > > >> > >
> > > > >> > > Can you try leaving your pipeline to include Tika,
but
> changing
> > > your
> > > > >> Solr
> > > > >> > > connection to go back to using the extracting update
handler?
> >  If
> > > > that
> > > > >> > > works, then I think we have correctly diagnosed the
problem.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Karl
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > On Tue, Aug 12, 2014 at 9:43 AM, Shinichiro Abe <
> > > > >> > > shinichiro.abe.1@gmail.com> wrote:
> > > > >> > >
> > > > >> > >> Hi Karl,
> > > > >> > >>
> > > > >> > >> The content field was garbled via /update and tika
connector.
> > > > >> > >> Sample Docs:
> http://www.rondhuit.com/download.html#whitepaper
> > > > >> > >> My mcf-job was from filesystem:Japanese PDF,XLS
to Solr.
> > > > >> > >>
> > > > >> > >> I was surprised that Solr threw an exception when
> > > > >> > >> en_US end-user-documentation.pdf
> > > > >> > >> was posted via tika connector. Posting files via
> > /update/extract
> > > > were
> > > > >> > not
> > > > >> > >> garbled, not threw exceptions.
> > > > >> > >> Could you reproduce this?
> > > > >> > >>
> > > > >> > >> 2268394 [qtp1224864813-14] ERROR
> > > > >> > >> org.apache.solr.servlet.SolrDispatchFilter
> > > > >> > >>  – null:java.lang.RuntimeException: [was class
> > > > >> > >> java.io.CharConversionException] Invalid UTF-8
character
> 0xffff
> > > at
> > > > >> char
> > > > >> > >> #112515, byte #184319)
> > > > >> > >> at
> > > > >> > >>
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> > > > >> > >> at
> > > > >>
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> > > > >> > >> at
> > > > >> > >>
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> > > > >> > >> at
> > > > >>
> > com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> > > > >> > >> at
> > > > >>
> org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:395)
> > > > >> > >> at
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> > > > >> > >> at
> > > > org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
> > > > >> > >> at
> > > > >> > >>
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> > > > >> > >> at
> > > > >> > >>
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> > > > >> > >> at
> > > > >> > >>
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > > > >> > >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
> > > > >> > >> ...
> > > > >> > >> Caused by: java.io.CharConversionException: Invalid
UTF-8
> > > character
> > > > >> > 0xffff
> > > > >> > >> at char #112515, byte #184319)
> > > > >> > >> at
> > com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
> > > > >> > >> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
> > > > >> > >> at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
> > > > >> > >> at
> com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
> > > > >> > >> at
> > > > >> > >>
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
> > > > >> > >> at
> > com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
> > > > >> > >> at
> > > > >> > >>
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
> > > > >> > >> at
> > > > >> > >>
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
> > > > >> > >> at
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
> > > > >> > >> at
> > > > >> > >>
> > > > >> > >>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
> > > > >> > >> ... 36 more
> > > > >> > >>
> > > > >> > >> Thanks,
> > > > >> > >> Shinichiro Abe
> > > > >> > >>
> > > > >> > >>
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> 2014-08-12 22:24 GMT+09:00 Karl Wright <daddywri@gmail.com>:
> > > > >> > >>
> > > > >> > >> > I ran "ant rat-sources", and inspected the
packages.  All
> > looks
> > > > >> good.
> > > > >> > >>  The
> > > > >> > >> > only comment is that the connector-lib area
has grown by
> > about
> > > > 18MB
> > > > >> > this
> > > > >> > >> > cycle, and of course all the images for the
Chinese
> > > documentation
> > > > >> add
> > > > >> > >> > another 5MB, so our binary packages are now
just about
> 200MB.
> > >  I
> > > > >> don't
> > > > >> > >> > think this something we can do a lot about,
though, except
> > > maybe
> > > > by
> > > > >> > >> > repackaging so we release connectors independently
of the
> > > > >> framework.
> > > > >> > >> >
> > > > >> > >> > I'll give a final vote after I hear more back
from Erlend
> and
> > > > >> Abe-san.
> > > > >> > >> >
> > > > >> > >> > Thanks,
> > > > >> > >> > Karl
> > > > >> > >> >
> > > > >> > >> >
> > > > >> > >> > On Tue, Aug 12, 2014 at 2:23 AM, Karl Wright
<
> > > daddywri@gmail.com
> > > > >
> > > > >> > >> wrote:
> > > > >> > >> >
> > > > >> > >> > > I request that the vote be left open
at least until
> > > 8/21/2014,
> > > > >> since
> > > > >> > >> 1.7
> > > > >> > >> > > is a major release and we want as many
people to try it
> out
> > > as
> > > > >> > >> possible
> > > > >> > >> > > before declaring it complete.  Thanks!
> > > > >> > >> > >
> > > > >> > >> > > Karl
> > > > >> > >> > >
> > > > >> > >> > >
> > > > >> > >> > >
> > > > >> > >> > > On Tue, Aug 12, 2014 at 12:44 AM, Shinichiro
Abe <
> > > > >> > >> > > shinichiro.abe.1@gmail.com> wrote:
> > > > >> > >> > >
> > > > >> > >> > >> Hi,
> > > > >> > >> > >>
> > > > >> > >> > >> +1 from me.
> > > > >> > >> > >>
> > > > >> > >> > >> -Checked SIGS, checksum by running
check_signatures.sh.
> > > > >> > >> > >> -Checked that the code signing Key
of Mingchun is
> > available
> > > > >> online.
> > > > >> > >> > >>
> > > > >> > >> > >> Shinichiro Abe
> > > > >> > >> > >>
> > > > >> > >> > >> On 2014/08/12, at 12:13, Mingchun
Zhao <
> > > > >> mingchun.zhao.2@gmail.com>
> > > > >> > >> > wrote:
> > > > >> > >> > >>
> > > > >> > >> > >> > Hi all,
> > > > >> > >> > >> >
> > > > >> > >> > >> > Please vote on whether to release
the ManifoldCF,
> > version
> > > > 1.7,
> > > > >> > RC0.
> > > > >> > >> > >> >
> > > > >> > >> > >> > You can find the artifact at:
> > > > >> > >> > >> >
> > > > >> > >> > >> >
> > > > http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0
> > > > >> > >> > >> >
> > > > >> > >> > >> > There is also a tag at:
> > > > >> > >> > >> >
> > > > >> > >> > >> >
> > > > >> https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0
> > > > >> > >> > >> >
> > > > >> > >> > >> > Vote will remain open at least
72 hours.
> > > > >> > >> > >> >
> > > > >> > >> > >> > Thanks!
> > > > >> > >> > >> > Mingchun Zhao
> > > > >> > >> > >>
> > > > >> > >> > >>
> > > > >> > >> > >
> > > > >> > >> >
> > > > >> > >>
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> --
> > > > >> > >> - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - -
> - -
> > > > >> > >> Shinichiro Abe
> > > > >> > >> 阿部 慎一朗
> > > > >> > >>
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
> > > > >> Shinichiro Abe
> > > > >> 阿部 慎一朗
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> > > Shinichiro Abe
> > > 阿部 慎一朗
> > >
> >
>
>
>
> --
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Shinichiro Abe
> 阿部 慎一朗
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message