manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Solr Extracting request handler
Date Wed, 18 Jun 2014 02:26:16 GMT
Hi Abe-san,

Near as I can tell, the major consumer of disk space is the Maven target
directories.  This is generating many tens of megabytes of temporary disk
usage for every connector.  Luckily if you use ant, this is not a problem.

Karl


On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Abe-san,
>
> Tika jars are not very big:
>
> C:\wip\mcf\trunk\lib>dir tika*
>  Volume in drive C has no label.
>  Volume Serial Number is 002E-D1F0
>
>  Directory of C:\wip\mcf\trunk\lib
>
> 06/05/2014  08:21 AM           493,374 tika-core.jar
> 06/05/2014  08:21 AM           523,677 tika-parsers.jar
>                2 File(s)      1,017,051 bytes
>                0 Dir(s)  140,792,315,904 bytes free
>
> The entire lib directory is 85M:
>
> 85,156,330 bytes
>
> The built binary image is still about 185Mb, I believe.  So I don't know
> why you think it is >1Gb?  Temporary class files?  I don't think we can
> avoid those.
>
> I'd rather not make things more complicated than they need to be by adding
> a new required service - even though it would fit naturally with the
> connector arrangement.
>
> Karl
>
>
>
>
>
> On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> shinichiro.abe.1@gmail.com> wrote:
>
>> Hi Karl,
>>
>> Okay, I assumed Tika connector outputs files.
>> If we post character data metadata got from Tika, "/update/extract"
>> handler
>> can handle this(provides params:
>> literal.content=value&literal.metaField=foobar
>> with using NullInputStream for binary data like CONNECTORS-936).
>>
>> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
>> connector uses Tika jars.
>> Tika connector and CloudSearch connector should extract text via
>> tika-server[1]
>> and MCF should not have many Tika jars, do you think?
>>
>> [1]
>> http://wiki.apache.org/tika/TikaJAXRS
>>
>> Thanks,
>> Shinichiro Abe
>>
>> On 2014/06/18, at 9:45, Karl Wright <daddywri@gmail.com> wrote:
>>
>> > Hi Abe-san,
>> >
>> > It sounds like you might be thinking that transformation connectors are
>> > like output connectors.  Just so we are clear, transformation
>> connectors in
>> > 1.7 receive a RepositoryDocument as input, and then pass a
>> > RepositoryDocument on to the next connector in the chain.  So I don't
>> know
>> > why .xml files would be involved.  I'd expect the Tika connector to
>> read a
>> > binary file from one RepositoryDocument object and convert its contents
>> to
>> > another RepositoryDocument object which would have character data and
>> > metadata only.  Would this work for your case, do you think?
>> >
>> > Karl
>> >
>> >
>> >
>> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
>> shinichiro.abe.1@gmail.com>
>> > wrote:
>> >
>> >> Hi Karl,
>> >>
>> >> Yes. I thought the standard update handler met that requirement.
>> >> For instance, Tika extractor transformation connector creates two
>> files.
>> >> 1. addtoSolr.xml for add and update
>> >> 2. deletetoSolr.xml for delete
>> >> File connector ingests these xml files, then Solr connector posts these
>> >> files by "/update" handler.
>> >>
>> >> In the the Solr Connector, other function as to update handler
>> >> might not be necessary except for  "/update" handler.
>> >>
>> >> Thanks,
>> >> Shinichiro Abe
>> >>
>> >> On 2014/06/18, at 8:02, Karl Wright <daddywri@gmail.com> wrote:
>> >>
>> >>> Hi Abe-san,
>> >>>
>> >>> So just to be sure -- you believe that no changes at all are required
>> to
>> >>> the Solr Connector as it stands now, other than to use the update
>> handler
>> >>> rather than the /update/extract handler?
>> >>>
>> >>> Karl
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
>> >> shinichiro.abe.1@gmail.com>
>> >>> wrote:
>> >>>
>> >>>>> As for changing the Solr connector so that it doesn't go to
the
>> >> extracting
>> >>>> update handler
>> >>>>
>> >>>> I don't think it needs to change Solr connector with new checkbox
>> >> because
>> >>>> currently we can change "/update/extract" into "/update" at 'Update
>> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could
post
>> >> CSV,
>> >>>> JSON and XML files to Solr by changing that and using File connector.
>> >> So I
>> >>>> wish we allow Tika extractor transformation connector to create
XML
>> >> files
>> >>>> that Solr expects to see.
>> >>>>
>> >>>> Regards,
>> >>>> Shinichiro Abe
>> >>>>
>> >>>>
>> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <daddywri@gmail.com>:
>> >>>>
>> >>>>> The pipeline code itself is now "complete" in trunk.  Zaizi
said
>> they'd
>> >>>>> contribute a Tika extractor transformation connector - and if
they
>> >> don't
>> >>>>> get around to that in a month or so, I may take a crack at it
>> myself.
>> >>>>>
>> >>>>> As for changing the Solr connector so that it doesn't go to
the
>> >>>> extracting
>> >>>>> update handler, it would be great if:
>> >>>>> (1) Someone created a ticket for this, and
>> >>>>> (2) A patch was provided that maintains backwards compatibility
with
>> >>>>> previous versions of the connector (so a checkbox would probably
>> need
>> >> to
>> >>>> go
>> >>>>> into the UI somewhere).  Do either of you want to start this
>> process?
>> >>>>>
>> >>>>> Thanks!
>> >>>>> Karl
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <daddywri@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>>> Hi guys,
>> >>>>>>
>> >>>>>> You folks may not have looked at 1.7 yet, but it has a full
>> pipeline,
>> >>>> and
>> >>>>>> is expected to have a Tika extractor as a transformation
connector.
>> >>>>>>
>> >>>>>> Karl
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
>> >>>>> m.grolla@sourcesense.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Thanks Alessandro,
>> >>>>>>>       that explains the situation clearly.
>> >>>>>>> And I agree that sending all the metadata as get parameter
can be
>> >>>>>>> problematic
>> >>>>>>>
>> >>>>>>> Cheers
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Matteo Grolla
>> >>>>>>> Sourcesense - making sense of Open Source
>> >>>>>>> http://www.sourcesense.com
>> >>>>>>>
>> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti
ha
>> >>>> scritto:
>> >>>>>>>
>> >>>>>>>> mmmm the point is that right now ManifoldCF has
no extractors.
>> >>>>>>>> The Repository connectors extracts directly the
binary and there
>> is
>> >>>> no
>> >>>>>>>> "Extractor Processor" yet.
>> >>>>>>>> But recently a pipe-line processor architecture
has been thought
>> (
>> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>> >>>>>>>> So can fit there.
>> >>>>>>>>
>> >>>>>>>> Cheers
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
>> m.grolla@sourcesense.com
>> >>>>> :
>> >>>>>>>>
>> >>>>>>>>> Since Solr extracting request handler takes
the binary and
>> extracts
>> >>>>>>> text
>> >>>>>>>>> what is the point of not using Manifold extractor
and send text
>> and
>> >>>>>>>>> binaries to solr?
>> >>>>>>>>> I mean the end result is the same solr indexes
text and stores
>> text
>> >>>>>>>>> So if manifold supports text extraction it seems
me this is the
>> >>>> place
>> >>>>>>>>> where it should be done
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Matteo Grolla
>> >>>>>>>>> Sourcesense - making sense of Open Source
>> >>>>>>>>> http://www.sourcesense.com
>> >>>>>>>>>
>> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio
David Perez
>> Morales
>> >>>> ha
>> >>>>>>>>> scritto:
>> >>>>>>>>>
>> >>>>>>>>>> Hi Matteo
>> >>>>>>>>>>
>> >>>>>>>>>> Manifold already handles the extraction,
but the only way to
>> send
>> >>>>>>> binary
>> >>>>>>>>>> content and document metadata to Solr is
using the
>> update/extract
>> >>>>>>>>> handler,
>> >>>>>>>>>> where the metadata is sent as query parameters
and the binary
>> >>>>> content
>> >>>>>>> is
>> >>>>>>>>>> sent in the body of the requests, allowing
Solr to use Tika to
>> >>>>> obtain
>> >>>>>>> the
>> >>>>>>>>>> raw content to be stored in Solr.
>> >>>>>>>>>>
>> >>>>>>>>>> Regards
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo
Grolla <
>> >>>>>>> m.grolla@sourcesense.com
>> >>>>>>>>>>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi During my first indexing I noticed
that manifold uses Solr
>> >>>>>>> extracting
>> >>>>>>>>>>> request handler to extract the content
of an xml file
>> >>>>>>>>>>> For performance reasons it would be
better if Manifold handled
>> >>>> the
>> >>>>>>>>>>> extraction letting Solr do the search
engine
>> >>>>>>>>>>> Is this because of the connector design,
framework design or
>> just
>> >>>>> to
>> >>>>>>> be
>> >>>>>>>>>>> done?
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> Matteo Grolla
>> >>>>>>>>>>> Sourcesense - making sense of Open Source
>> >>>>>>>>>>> http://www.sourcesense.com
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> --
>> >>>>>>>>>>
>> >>>>>>>>>> ------------------------------
>> >>>>>>>>>> This message should be regarded as confidential.
If you have
>> >>>>> received
>> >>>>>>>>> this
>> >>>>>>>>>> email in error please notify the sender
and destroy it
>> >>>> immediately.
>> >>>>>>>>>> Statements of intent shall only become binding
when confirmed
>> in
>> >>>>> hard
>> >>>>>>>>> copy
>> >>>>>>>>>> by an authorised signatory.
>> >>>>>>>>>>
>> >>>>>>>>>> Zaizi Ltd is registered in England and Wales
with the
>> registration
>> >>>>>>> number
>> >>>>>>>>>> 6440931. The Registered Office is Brook
House, 229 Shepherds
>> Bush
>> >>>>>>> Road,
>> >>>>>>>>>> London W6 7AN.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> --------------------------
>> >>>>>>>>
>> >>>>>>>> Benedetti Alessandro
>> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>> >>>>>>>>
>> >>>>>>>> "Tyger, tyger burning bright
>> >>>>>>>> In the forests of the night,
>> >>>>>>>> What immortal hand or eye
>> >>>>>>>> Could frame thy fearful symmetry?"
>> >>>>>>>>
>> >>>>>>>> William Blake - Songs of Experience -1794 England
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> >>>> Shinichiro Abe
>> >>>> 阿部 慎一朗
>> >>>>
>> >>
>> >>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message