manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matteo Grolla <m.gro...@sourcesense.com>
Subject Re: Solr Extracting request handler
Date Wed, 18 Jun 2014 14:20:58 GMT
Hi Alessandro,
	ideally I think that text extraction from rich documents should be Manifold responsibility,
not Solr's
So the ideal place to implement it would be in the new document processing pipeline (using
Tika)

-- 
Matteo Grolla
Sourcesense - making sense of Open Source
http://www.sourcesense.com

Il giorno 18/giu/2014, alle ore 16:16, Alessandro Benedetti ha scritto:

> Hello Karl,
> What i was thinking is:
> assuming we have the Tika Connector, the responsibility to extract content
> will pass from Solr to the Tika processor.
> 
> So we can change the part in the Solr Connector that manages the building
> of the request to send to the Extract update handler.
> Particularly that part will change in the classic way: usually it's good to
> build a SolrDocument in SolrJ and then add it to SolrServer.
> 
> Why should we give retrocompatibility from Solr Connector point of view ?
> From the user point of view, a Job will be selected with the Tika Conenctor
> in the pipeline, so we are providing the same identical feature.
> One way can be to make the Tika Processor Connector by default in the
> pipeline, and someone will be able to deactivate it only if needed.
> 
> Cheers
> 
> 
> 
> 2014-06-18 14:32 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> 
>> Hi Alessandro,
>> What is your concrete proposal to change the Solr connector?  Bear in mind
>> that we do need to maintain backwards compatibility.  If you list your
>> specific changes, not in any huge detail, but with enough detail that we
>> understand your proposal, that would help.  What happens to the UI?  What
>> happens to the internals?
>> 
>> Thanks,
>> Karl
>> 
>> 
>> 
>> On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti <
>> benedetti.alex85@gmail.com> wrote:
>> 
>>> But guys, why not simply pass to a classic SolrJ SolrDocument creation
>> and
>>> ingestion in the Solr Server ? Easy and Straighforward !
>>> 
>>> In the end at that point the RepositoryDocument will me only a Map of
>>> metadata and values.
>>> Content will be part of that, so I guess the conversion to a SolrDocument
>>> will be immediate.
>>> 
>>> Cheers
>>> 
>>> 
>>> 2014-06-18 3:26 GMT+01:00 Karl Wright <daddywri@gmail.com>:
>>> 
>>>> Hi Abe-san,
>>>> 
>>>> Near as I can tell, the major consumer of disk space is the Maven
>> target
>>>> directories.  This is generating many tens of megabytes of temporary
>> disk
>>>> usage for every connector.  Luckily if you use ant, this is not a
>>> problem.
>>>> 
>>>> Karl
>>>> 
>>>> 
>>>> On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <daddywri@gmail.com>
>> wrote:
>>>> 
>>>>> Hi Abe-san,
>>>>> 
>>>>> Tika jars are not very big:
>>>>> 
>>>>> C:\wip\mcf\trunk\lib>dir tika*
>>>>> Volume in drive C has no label.
>>>>> Volume Serial Number is 002E-D1F0
>>>>> 
>>>>> Directory of C:\wip\mcf\trunk\lib
>>>>> 
>>>>> 06/05/2014  08:21 AM           493,374 tika-core.jar
>>>>> 06/05/2014  08:21 AM           523,677 tika-parsers.jar
>>>>>               2 File(s)      1,017,051 bytes
>>>>>               0 Dir(s)  140,792,315,904 bytes free
>>>>> 
>>>>> The entire lib directory is 85M:
>>>>> 
>>>>> 85,156,330 bytes
>>>>> 
>>>>> The built binary image is still about 185Mb, I believe.  So I don't
>>> know
>>>>> why you think it is >1Gb?  Temporary class files?  I don't think we
>> can
>>>>> avoid those.
>>>>> 
>>>>> I'd rather not make things more complicated than they need to be by
>>>> adding
>>>>> a new required service - even though it would fit naturally with the
>>>>> connector arrangement.
>>>>> 
>>>>> Karl
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
>>>>> shinichiro.abe.1@gmail.com> wrote:
>>>>> 
>>>>>> Hi Karl,
>>>>>> 
>>>>>> Okay, I assumed Tika connector outputs files.
>>>>>> If we post character data metadata got from Tika, "/update/extract"
>>>>>> handler
>>>>>> can handle this(provides params:
>>>>>> literal.content=value&literal.metaField=foobar
>>>>>> with using NullInputStream for binary data like CONNECTORS-936).
>>>>>> 
>>>>>> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
>>>>>> connector uses Tika jars.
>>>>>> Tika connector and CloudSearch connector should extract text via
>>>>>> tika-server[1]
>>>>>> and MCF should not have many Tika jars, do you think?
>>>>>> 
>>>>>> [1]
>>>>>> http://wiki.apache.org/tika/TikaJAXRS
>>>>>> 
>>>>>> Thanks,
>>>>>> Shinichiro Abe
>>>>>> 
>>>>>> On 2014/06/18, at 9:45, Karl Wright <daddywri@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi Abe-san,
>>>>>>> 
>>>>>>> It sounds like you might be thinking that transformation
>> connectors
>>>> are
>>>>>>> like output connectors.  Just so we are clear, transformation
>>>>>> connectors in
>>>>>>> 1.7 receive a RepositoryDocument as input, and then pass a
>>>>>>> RepositoryDocument on to the next connector in the chain.  So
I
>>> don't
>>>>>> know
>>>>>>> why .xml files would be involved.  I'd expect the Tika connector
>> to
>>>>>> read a
>>>>>>> binary file from one RepositoryDocument object and convert its
>>>> contents
>>>>>> to
>>>>>>> another RepositoryDocument object which would have character
data
>>> and
>>>>>>> metadata only.  Would this work for your case, do you think?
>>>>>>> 
>>>>>>> Karl
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
>>>>>> shinichiro.abe.1@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Karl,
>>>>>>>> 
>>>>>>>> Yes. I thought the standard update handler met that requirement.
>>>>>>>> For instance, Tika extractor transformation connector creates
two
>>>>>> files.
>>>>>>>> 1. addtoSolr.xml for add and update
>>>>>>>> 2. deletetoSolr.xml for delete
>>>>>>>> File connector ingests these xml files, then Solr connector
posts
>>>> these
>>>>>>>> files by "/update" handler.
>>>>>>>> 
>>>>>>>> In the the Solr Connector, other function as to update handler
>>>>>>>> might not be necessary except for  "/update" handler.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Shinichiro Abe
>>>>>>>> 
>>>>>>>> On 2014/06/18, at 8:02, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>> 
>>>>>>>>> Hi Abe-san,
>>>>>>>>> 
>>>>>>>>> So just to be sure -- you believe that no changes at
all are
>>>> required
>>>>>> to
>>>>>>>>> the Solr Connector as it stands now, other than to use
the
>> update
>>>>>> handler
>>>>>>>>> rather than the /update/extract handler?
>>>>>>>>> 
>>>>>>>>> Karl
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
>>>>>>>> shinichiro.abe.1@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>>> As for changing the Solr connector so that it
doesn't go to
>> the
>>>>>>>> extracting
>>>>>>>>>> update handler
>>>>>>>>>> 
>>>>>>>>>> I don't think it needs to change Solr connector with
new
>> checkbox
>>>>>>>> because
>>>>>>>>>> currently we can change "/update/extract" into "/update"
at
>>> 'Update
>>>>>>>>>> Handler' at Paths tab in Solr connector UI. I confirmed
I could
>>>> post
>>>>>>>> CSV,
>>>>>>>>>> JSON and XML files to Solr by changing that and using
File
>>>> connector.
>>>>>>>> So I
>>>>>>>>>> wish we allow Tika extractor transformation connector
to create
>>> XML
>>>>>>>> files
>>>>>>>>>> that Solr expects to see.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Shinichiro Abe
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <daddywri@gmail.com>:
>>>>>>>>>> 
>>>>>>>>>>> The pipeline code itself is now "complete" in
trunk.  Zaizi
>> said
>>>>>> they'd
>>>>>>>>>>> contribute a Tika extractor transformation connector
- and if
>>> they
>>>>>>>> don't
>>>>>>>>>>> get around to that in a month or so, I may take
a crack at it
>>>>>> myself.
>>>>>>>>>>> 
>>>>>>>>>>> As for changing the Solr connector so that it
doesn't go to
>> the
>>>>>>>>>> extracting
>>>>>>>>>>> update handler, it would be great if:
>>>>>>>>>>> (1) Someone created a ticket for this, and
>>>>>>>>>>> (2) A patch was provided that maintains backwards
>> compatibility
>>>> with
>>>>>>>>>>> previous versions of the connector (so a checkbox
would
>> probably
>>>>>> need
>>>>>>>> to
>>>>>>>>>> go
>>>>>>>>>>> into the UI somewhere).  Do either of you want
to start this
>>>>>> process?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>> Karl
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright
<
>>> daddywri@gmail.com
>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>> 
>>>>>>>>>>>> You folks may not have looked at 1.7 yet,
but it has a full
>>>>>> pipeline,
>>>>>>>>>> and
>>>>>>>>>>>> is expected to have a Tika extractor as a
transformation
>>>> connector.
>>>>>>>>>>>> 
>>>>>>>>>>>> Karl
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo
Grolla <
>>>>>>>>>>> m.grolla@sourcesense.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks Alessandro,
>>>>>>>>>>>>>      that explains the situation clearly.
>>>>>>>>>>>>> And I agree that sending all the metadata
as get parameter
>> can
>>>> be
>>>>>>>>>>>>> problematic
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>>>> Sourcesense - making sense of Open Source
>>>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Il giorno 16/giu/2014, alle ore 17:09,
Alessandro Benedetti
>> ha
>>>>>>>>>> scritto:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> mmmm the point is that right now
ManifoldCF has no
>>> extractors.
>>>>>>>>>>>>>> The Repository connectors extracts
directly the binary and
>>>> there
>>>>>> is
>>>>>>>>>> no
>>>>>>>>>>>>>> "Extractor Processor" yet.
>>>>>>>>>>>>>> But recently a pipe-line processor
architecture has been
>>>> thought
>>>>>> (
>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
>>>>>>>>>>>>>> So can fit there.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo
Grolla <
>>>>>> m.grolla@sourcesense.com
>>>>>>>>>>> :
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Since Solr extracting request
handler takes the binary and
>>>>>> extracts
>>>>>>>>>>>>> text
>>>>>>>>>>>>>>> what is the point of not using
Manifold extractor and send
>>>> text
>>>>>> and
>>>>>>>>>>>>>>> binaries to solr?
>>>>>>>>>>>>>>> I mean the end result is the
same solr indexes text and
>>> stores
>>>>>> text
>>>>>>>>>>>>>>> So if manifold supports text
extraction it seems me this
>> is
>>>> the
>>>>>>>>>> place
>>>>>>>>>>>>>>> where it should be done
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>>>>>> Sourcesense - making sense of
Open Source
>>>>>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Il giorno 16/giu/2014, alle ore
16:51, Antonio David Perez
>>>>>> Morales
>>>>>>>>>> ha
>>>>>>>>>>>>>>> scritto:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Matteo
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Manifold already handles
the extraction, but the only way
>>> to
>>>>>> send
>>>>>>>>>>>>> binary
>>>>>>>>>>>>>>>> content and document metadata
to Solr is using the
>>>>>> update/extract
>>>>>>>>>>>>>>> handler,
>>>>>>>>>>>>>>>> where the metadata is sent
as query parameters and the
>>> binary
>>>>>>>>>>> content
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> sent in the body of the requests,
allowing Solr to use
>> Tika
>>>> to
>>>>>>>>>>> obtain
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> raw content to be stored
in Solr.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mon, Jun 16, 2014 at 4:35
PM, Matteo Grolla <
>>>>>>>>>>>>> m.grolla@sourcesense.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi During my first indexing
I noticed that manifold uses
>>>> Solr
>>>>>>>>>>>>> extracting
>>>>>>>>>>>>>>>>> request handler to extract
the content of an xml file
>>>>>>>>>>>>>>>>> For performance reasons
it would be better if Manifold
>>>> handled
>>>>>>>>>> the
>>>>>>>>>>>>>>>>> extraction letting Solr
do the search engine
>>>>>>>>>>>>>>>>> Is this because of the
connector design, framework
>> design
>>> or
>>>>>> just
>>>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> done?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Matteo Grolla
>>>>>>>>>>>>>>>>> Sourcesense - making
sense of Open Source
>>>>>>>>>>>>>>>>> http://www.sourcesense.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>>> This message should be regarded
as confidential. If you
>>> have
>>>>>>>>>>> received
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> email in error please notify
the sender and destroy it
>>>>>>>>>> immediately.
>>>>>>>>>>>>>>>> Statements of intent shall
only become binding when
>>> confirmed
>>>>>> in
>>>>>>>>>>> hard
>>>>>>>>>>>>>>> copy
>>>>>>>>>>>>>>>> by an authorised signatory.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Zaizi Ltd is registered in
England and Wales with the
>>>>>> registration
>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>> 6440931. The Registered Office
is Brook House, 229
>>> Shepherds
>>>>>> Bush
>>>>>>>>>>>>> Road,
>>>>>>>>>>>>>>>> London W6 7AN.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --------------------------
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Benedetti Alessandro
>>>>>>>>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> "Tyger, tyger burning bright
>>>>>>>>>>>>>> In the forests of the night,
>>>>>>>>>>>>>> What immortal hand or eye
>>>>>>>>>>>>>> Could frame thy fearful symmetry?"
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> William Blake - Songs of Experience
-1794 England
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - -
>>>>>>>>>> Shinichiro Abe
>>>>>>>>>> 阿部 慎一朗
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> --------------------------
>>> 
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>> 
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>> 
>>> William Blake - Songs of Experience -1794 England
>>> 
>> 
> 
> 
> 
> -- 
> --------------------------
> 
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message