manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [VOTE] Release Apache ManifoldCF 1.7 RC0
Date Tue, 12 Aug 2014 17:15:12 GMT
Hi Abe-san,

Actually, for the Tika transformation connector, there are TWO different
mime types.  One mime type represents what the connector generates.  The
other represents what the connector can accept.  This is true of all
transformation connectors.

Hope that helps.
Karl



On Tue, Aug 12, 2014 at 12:59 PM, Shinichiro Abe <shinichiro.abe.1@gmail.com
> wrote:

> Ok, I understand we specify 'text/plain;charset=utf-8' string temporarily
> so that we accept all kinds of mime types.
>
> Thanks,
> Shinichiro Abe
>
>
>
> 2014-08-13 1:25 GMT+09:00 Karl Wright <daddywri@gmail.com>:
>
> > bq. I have a question.
> > What is this? -> hard-coded mymetype checkings,
> "text/plain;charset=utf-8".
> > For what? This seems to be unnecessary.
> >
> >
> http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l156
> >
> >
> http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l99
> >
> >
> > Hi Abe-san,
> >
> > The idea is that the Tika extractor always confirms that the downstream
> > pipeline accepts text/plain;charset=utf-8 because that is what it always
> > outputs.  On the upstream side, we should technically only accept
> documents
> > that Tika knows how to extract.  Right now, we accept all kinds, because
> I
> > don't know what that list is.
> >
> > Karl
> >
> >
> >
> >
> > On Tue, Aug 12, 2014 at 12:20 PM, Shinichiro Abe <
> > shinichiro.abe.1@gmail.com
> > > wrote:
> >
> > > Hi Karl,
> > >
> > > I also confirmed that using a SJIS file attached on CONNECTORS-613,
> > > then the file was not garbled and could extract content and metadata
> > > properly by tika connector.
> > > Therefore currently we don't need to respin RC.
> > >
> > > I have a question.
> > > What is this? -> hard-coded mymetype checkings,
> > "text/plain;charset=utf-8".
> > > For what? This seems to be unnecessary.
> > >
> > >
> >
> http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l156
> > >
> > >
> >
> http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l99
> > >
> > > Thanks,
> > > Shinichiro Abe
> > >
> > >
> > > 2014-08-13 1:09 GMT+09:00 Karl Wright <daddywri@gmail.com>:
> > >
> > > > Ok, I closed the ticket.
> > > >
> > > > So thanks, I think I'm now read to vote +1.
> > > >
> > > > Karl
> > > >
> > > >
> > > >
> > > > On Tue, Aug 12, 2014 at 11:38 AM, Shinichiro Abe <
> > > > shinichiro.abe.1@gmail.com
> > > > > wrote:
> > > >
> > > > > I apologize for the mistake, I forgot to configure tika connector
> in
> > > the
> > > > > job. I configured documentFilter and Metadata adjuster only.
> > > > > It works by adding tika connector, there is no problem. English
> pdf,
> > > > > Japanese pdf/xls are not garbled!
> > > > > I'm sorry! So we don't have to fix CONNECTORS-1008.
> > > > >
> > > > > Shinichiro Abe
> > > > >
> > > > >
> > > > > 2014-08-13 0:24 GMT+09:00 Karl Wright <daddywri@gmail.com>:
> > > > >
> > > > > > Ok, I've done some more experimentation, and confirmed that
there
> > is
> > > > > really
> > > > > > only ONE problem: in SolrJ or Solr.  ManifoldCF is working
> > perfectly.
> > > > > >
> > > > > > The ticket I created, CONNECTORS-1008, will therefore be
> postponed
> > to
> > > > MCF
> > > > > > 2.0.  The workaround is the use the extracting update handler
> even
> > > when
> > > > > the
> > > > > > content has already been extracted on the MCF side.  So we should
> > > open
> > > > a
> > > > > > SOLR ticket, but there is no reason to respin the MCF release.
> > > > > >
> > > > > > Karl
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 12, 2014 at 10:18 AM, Karl Wright <
> daddywri@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > So there are two problems.  One problem is that the Tika
> > Extractor
> > > is
> > > > > not
> > > > > > > doing the right thing (I think).  The second problem is
that
> > valid
> > > > > > > characters are not being sent to Solr when SolrInputDocument
is
> > > used.
> > > > > > >
> > > > > > > Karl
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Aug 12, 2014 at 10:15 AM, Shinichiro Abe <
> > > > > > > shinichiro.abe.1@gmail.com> wrote:
> > > > > > >
> > > > > > >> Thanks Karl,
> > > > > > >>
> > > > > > >> When posting MCF's end-user-documentation.pdf(English)
via
> > > standard
> > > > > > update
> > > > > > >> handler,
> > > > > > >> Solr throws an exception, this is a problem, I'm not
sure why.
> > > > > > >> It works by leaving my pipeline to include Tika and
using the
> > > > > extracting
> > > > > > >> update handler.
> > > > > > >> Solr's Tika version matches MCF's Tika one(1.5).
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> 2014-08-12 23:10 GMT+09:00 Karl Wright <daddywri@gmail.com>:
> > > > > > >>
> > > > > > >> > It looks like the Tika content extraction is not
actually
> > > > producing
> > > > > > >> valid
> > > > > > >> > utf-8.  I'm not sure what it is producing, but
that is the
> > > > > underlying
> > > > > > >> > problem.
> > > > > > >> >
> > > > > > >> > I'll create a ticket and look into it.
> > > > > > >> >
> > > > > > >> > Karl
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Tue, Aug 12, 2014 at 9:52 AM, Karl Wright <
> > > daddywri@gmail.com>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > Hi Abe-san,
> > > > > > >> > >
> > > > > > >> > > It looks to me like SolrJ when it uses SolrInputDocument
> > > cannot
> > > > > > >> correctly
> > > > > > >> > > post some kinds of characters.  The exception
is coming
> from
> > > > > inside
> > > > > > >> Solr
> > > > > > >> > > itself -- not SolrJ.  So I think a Solr ticket
would be
> the
> > > > right
> > > > > > >> thing
> > > > > > >> > to
> > > > > > >> > > do here.
> > > > > > >> > >
> > > > > > >> > > Can you try leaving your pipeline to include
Tika, but
> > > changing
> > > > > your
> > > > > > >> Solr
> > > > > > >> > > connection to go back to using the extracting
update
> > handler?
> > > >  If
> > > > > > that
> > > > > > >> > > works, then I think we have correctly diagnosed
the
> problem.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Karl
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > On Tue, Aug 12, 2014 at 9:43 AM, Shinichiro
Abe <
> > > > > > >> > > shinichiro.abe.1@gmail.com> wrote:
> > > > > > >> > >
> > > > > > >> > >> Hi Karl,
> > > > > > >> > >>
> > > > > > >> > >> The content field was garbled via /update
and tika
> > connector.
> > > > > > >> > >> Sample Docs:
> > > http://www.rondhuit.com/download.html#whitepaper
> > > > > > >> > >> My mcf-job was from filesystem:Japanese
PDF,XLS to Solr.
> > > > > > >> > >>
> > > > > > >> > >> I was surprised that Solr threw an exception
when
> > > > > > >> > >> en_US end-user-documentation.pdf
> > > > > > >> > >> was posted via tika connector. Posting
files via
> > > > /update/extract
> > > > > > were
> > > > > > >> > not
> > > > > > >> > >> garbled, not threw exceptions.
> > > > > > >> > >> Could you reproduce this?
> > > > > > >> > >>
> > > > > > >> > >> 2268394 [qtp1224864813-14] ERROR
> > > > > > >> > >> org.apache.solr.servlet.SolrDispatchFilter
> > > > > > >> > >>  – null:java.lang.RuntimeException:
[was class
> > > > > > >> > >> java.io.CharConversionException] Invalid
UTF-8 character
> > > 0xffff
> > > > > at
> > > > > > >> char
> > > > > > >> > >> #112515, byte #184319)
> > > > > > >> > >> at
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> > > > > > >> > >> at
> > > > > > >>
> > > com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> > > > > > >> > >> at
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> > > > > > >> > >> at
> > > > > > >>
> > > > com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> > > > > > >> > >> at
> > > > > > >>
> > > org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:395)
> > > > > > >> > >> at
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> > > > > > >> > >> at
> > > > > > org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
> > > > > > >> > >> at
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> > > > > > >> > >> at
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> > > > > > >> > >> at
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > > > > > >> > >> at
> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
> > > > > > >> > >> ...
> > > > > > >> > >> Caused by: java.io.CharConversionException:
Invalid UTF-8
> > > > > character
> > > > > > >> > 0xffff
> > > > > > >> > >> at char #112515, byte #184319)
> > > > > > >> > >> at
> > > > com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
> > > > > > >> > >> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
> > > > > > >> > >> at
> com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
> > > > > > >> > >> at
> > > com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
> > > > > > >> > >> at
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
> > > > > > >> > >> at
> > > > com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
> > > > > > >> > >> at
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
> > > > > > >> > >> at
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
> > > > > > >> > >> at
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
> > > > > > >> > >> at
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
> > > > > > >> > >> ... 36 more
> > > > > > >> > >>
> > > > > > >> > >> Thanks,
> > > > > > >> > >> Shinichiro Abe
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> > >> 2014-08-12 22:24 GMT+09:00 Karl Wright
<
> daddywri@gmail.com
> > >:
> > > > > > >> > >>
> > > > > > >> > >> > I ran "ant rat-sources", and inspected
the packages.
>  All
> > > > looks
> > > > > > >> good.
> > > > > > >> > >>  The
> > > > > > >> > >> > only comment is that the connector-lib
area has grown
> by
> > > > about
> > > > > > 18MB
> > > > > > >> > this
> > > > > > >> > >> > cycle, and of course all the images
for the Chinese
> > > > > documentation
> > > > > > >> add
> > > > > > >> > >> > another 5MB, so our binary packages
are now just about
> > > 200MB.
> > > > >  I
> > > > > > >> don't
> > > > > > >> > >> > think this something we can do a
lot about, though,
> > except
> > > > > maybe
> > > > > > by
> > > > > > >> > >> > repackaging so we release connectors
independently of
> the
> > > > > > >> framework.
> > > > > > >> > >> >
> > > > > > >> > >> > I'll give a final vote after I hear
more back from
> Erlend
> > > and
> > > > > > >> Abe-san.
> > > > > > >> > >> >
> > > > > > >> > >> > Thanks,
> > > > > > >> > >> > Karl
> > > > > > >> > >> >
> > > > > > >> > >> >
> > > > > > >> > >> > On Tue, Aug 12, 2014 at 2:23 AM,
Karl Wright <
> > > > > daddywri@gmail.com
> > > > > > >
> > > > > > >> > >> wrote:
> > > > > > >> > >> >
> > > > > > >> > >> > > I request that the vote be
left open at least until
> > > > > 8/21/2014,
> > > > > > >> since
> > > > > > >> > >> 1.7
> > > > > > >> > >> > > is a major release and we want
as many people to try
> it
> > > out
> > > > > as
> > > > > > >> > >> possible
> > > > > > >> > >> > > before declaring it complete.
 Thanks!
> > > > > > >> > >> > >
> > > > > > >> > >> > > Karl
> > > > > > >> > >> > >
> > > > > > >> > >> > >
> > > > > > >> > >> > >
> > > > > > >> > >> > > On Tue, Aug 12, 2014 at 12:44
AM, Shinichiro Abe <
> > > > > > >> > >> > > shinichiro.abe.1@gmail.com>
wrote:
> > > > > > >> > >> > >
> > > > > > >> > >> > >> Hi,
> > > > > > >> > >> > >>
> > > > > > >> > >> > >> +1 from me.
> > > > > > >> > >> > >>
> > > > > > >> > >> > >> -Checked SIGS, checksum
by running
> > check_signatures.sh.
> > > > > > >> > >> > >> -Checked that the code
signing Key of Mingchun is
> > > > available
> > > > > > >> online.
> > > > > > >> > >> > >>
> > > > > > >> > >> > >> Shinichiro Abe
> > > > > > >> > >> > >>
> > > > > > >> > >> > >> On 2014/08/12, at 12:13,
Mingchun Zhao <
> > > > > > >> mingchun.zhao.2@gmail.com>
> > > > > > >> > >> > wrote:
> > > > > > >> > >> > >>
> > > > > > >> > >> > >> > Hi all,
> > > > > > >> > >> > >> >
> > > > > > >> > >> > >> > Please vote on whether
to release the ManifoldCF,
> > > > version
> > > > > > 1.7,
> > > > > > >> > RC0.
> > > > > > >> > >> > >> >
> > > > > > >> > >> > >> > You can find the artifact
at:
> > > > > > >> > >> > >> >
> > > > > > >> > >> > >> >
> > > > > > http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0
> > > > > > >> > >> > >> >
> > > > > > >> > >> > >> > There is also a tag
at:
> > > > > > >> > >> > >> >
> > > > > > >> > >> > >> >
> > > > > > >>
> > https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0
> > > > > > >> > >> > >> >
> > > > > > >> > >> > >> > Vote will remain open
at least 72 hours.
> > > > > > >> > >> > >> >
> > > > > > >> > >> > >> > Thanks!
> > > > > > >> > >> > >> > Mingchun Zhao
> > > > > > >> > >> > >>
> > > > > > >> > >> > >>
> > > > > > >> > >> > >
> > > > > > >> > >> >
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> > >>
> > > > > > >> > >> --
> > > > > > >> > >> - - - - - - - - - - - - - - - - - - -
- - - - - - - - -
> - -
> > > - -
> > > > > > >> > >> Shinichiro Abe
> > > > > > >> > >> 阿部 慎一朗
> > > > > > >> > >>
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - -
> -
> > > > > > >> Shinichiro Abe
> > > > > > >> 阿部 慎一朗
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > Shinichiro Abe
> > > > > 阿部 慎一朗
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> > > Shinichiro Abe
> > > 阿部 慎一朗
> > >
> >
>
>
>
> --
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Shinichiro Abe
> 阿部 慎一朗
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message