lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Thacker <varunthacker1...@gmail.com>
Subject Re: How to run a subsequent update query to documents indexed from a dataimport query
Date Mon, 27 Jan 2014 05:59:22 GMT
Hi Dileepa,

If I understand correctly this is what happens in your system correctly :

1. DIH Sends data to Solr
2. You have written a custom update processor (
http://wiki.apache.org/solr/UpdateRequestProcessor) which the asks your
Stanbol server for meta data, adds it to the document and then indexes it.

Its the part where you query the Stanbol server and wait for the response
which takes time and you want to reduce this.

According to me instead of waiting for your response from the Stanbol
server and then indexing it, You could send the required field data from
the doc to your Stanbol server and continue. Once Stanbol as enriched the
document, you re-index the document and update it with the meta-data.

This method makes you re-index the document but the changes from your
client would be visible faster.

Alternately you could do the same thing at the DIH level by writing a
customer Transformer (
http://wiki.apache.org/solr/DataImportHandler#Writing_Custom_Transformers)


On Mon, Jan 27, 2014 at 10:44 AM, Dileepa Jayakody <
dileepajayakody@gmail.com> wrote:

> Hi Ahmet,
>
>
>
> On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan <iorixxx@yahoo.com> wrote:
>
> > Hi,
> >
> > Here is what I understand from your Question.
> >
> > You have a custom update processor that runs with DIH. But it is slow.
> You
> > want to run that text enhancement component after DIH. How would this
> help
> > to speed up things?
>
>
> > In this approach you will read/query/search already indexed and committed
> > solr documents and run text enhancement thing on them. Probably this
> > process will add new additional fields. And then you will update these
> solr
> > documents?
> >
> > Did I understand your use case correctly?
> >
>
> Yes, that is exactly what I want to achieve.
> I want to separate out the enhancement process from the dataimport process.
> The dataimport process will be invoked by a client when new data is
> added/updated to the mysql database. Therefore the dataimport process with
> mandatory fields of the documents should be indexed as soon as possible.
> Mandatory fields are mapped to the data table columns in the
> data-config.xml and the normal /dataimport process doesn't take much time.
> The enhancements are done in my custom processor by sending the content
> field of the document to an external Stanbol[1] server to detect NLP
> enhancements. Then new NLP fields are added to the document (detected
> persons, organizations, places in the content) in the custom update
> processor and if this is executed during the dataimport process, it takes a
> lot of time.
>
> The NLP fields are not mandatory for the primary usage of the application
> which is to query documents with mandatory fields. The NLP fields are
> required for custom queries for Person, Organization entities. Therefore
> the NLP update process should be run as a background process detached from
> the primary /dataimport process. It should not slow down the existing
> /dataimport process.
>
> That's why I am looking for the best way to achieve my objective. I want to
> implement a way to separately update the imported documents from
> /dataimport  to detect NLP enhancements. Currently I'm having the idea of
> adopting a timestamp based approach to trigger a /update query to all
> documents imported after the last_index_time in dataimport.prop and update
> them with NLP fields.
>
> Hope my requirement is clear :). Appreciate your suggestions.
>
> [1] http://stanbol.apache.org/
>
> >
> >
> >
> >
> > On Sunday, January 26, 2014 8:43 PM, Dileepa Jayakody <
> > dileepajayakody@gmail.com> wrote:
> > Hi all,
> >
> > Any ideas on how to run a reindex update process for all the imported
> > documents from a /dataimport query?
> > Appreciate your help.
> >
> >
> > Thanks,
> > Dileepa
> >
> >
> >
> > On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody <
> > dileepajayakody@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I did some research on this and found some alternatives useful to my
> > > usecase. Please give your ideas.
> > >
> > > Can I update all documents indexed after a /dataimport query using the
> > > last_indexed_time in dataimport.properties?
> > > If so can anyone please give me some pointers?
> > > What I currently have in mind is something like below;
> > >
> > > 1. Store the indexing timestamp of the document as a field
> > > eg: <field name="timestamp" type="date" indexed="true" stored="true"
> > default="NOW"
> > > multiValued="false"/>
> > >
> > > 2. Read the last_index_time from the dataimport.properties
> > >
> > > 3. Query all document id's indexed after the last_index_time and send
> > them
> > > through the Stanbol update processor.
> > >
> > > But I have a question here;
> > > Does the last_index_time refer to when the dataimport is
> > > started(onImportStart) or when the dataimport is finished
> (onImportEnd)?
> > > If it's onImportEnd timestamp, them this solution won't work because
> the
> > > timestamp indexed in the document field will be : onImportStart<
> > > doc-index-timestamp < onImportEnd.
> > >
> > >
> > > Another alternative I can think of is trigger an update chain via a
> > > EventListener configured to run after a dataimport is processed
> > > (onImportEnd).
> > > In this case can the context in DIH give the list of document ids
> > > processed in the /dataimport request? If so I can send those doc ids
> with
> > > an /update query to run the Stanbol update process.
> > >
> > > Please give me your ideas and suggestions.
> > >
> > > Thanks,
> > > Dileepa
> > >
> > >
> > >
> > >
> > > On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody <
> > > dileepajayakody@gmail.com> wrote:
> > >
> > >> Hi All,
> > >>
> > >> I have a Solr requirement to send all the documents imported from a
> > >> /dataimport query to go through another update chain as a separate
> > >> background process.
> > >>
> > >> Currently I have configured my custom update chain in the /dataimport
> > >> handler itself. But since my custom update process need to connect to
> an
> > >> external enhancement engine (Apache Stanbol) to enhance the documents
> > with
> > >> some NLP fields, it has a negative impact on /dataimport process.
> > >> The solution will be to have a separate update process running to
> > enhance
> > >> the content of the documents imported from /dataimport.
> > >>
> > >> Currently I have configured my custom Stanbol Processor as below in my
> > >> /dataimport handler.
> > >>
> > >> <requestHandler name="/dataimport" class="solr.DataImportHandler">
> > >> <lst name="defaults">
> > >>  <str name="config">data-config.xml</str>
> > >> <str name="update.chain">stanbolInterceptor</str>
> > >>  </lst>
> > >>    </requestHandler>
> > >>
> > >> <updateRequestProcessorChain name="stanbolInterceptor">
> > >>  <processor
> > >> class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/>
> > >> <processor class="solr.RunUpdateProcessorFactory" />
> > >>   </updateRequestProcessorChain>
> > >>
> > >>
> > >> What I need now is to separate the 2 processes of dataimport and
> > >> stanbol-enhancement.
> > >> So this is like runing a separate re-indexing process periodically
> over
> > >> the documents imported from /dataimport for Stanbol fields.
> > >>
> > >> The question is how to trigger my Stanbol update process to the
> > documents
> > >> imported from /dataimport?
> > >> In Solr to trigger /update query we need to know the id and the fields
> > of
> > >> the document to be updated. In my case I need to run all the documents
> > >> imported from the previous /dataimport process through a stanbol
> > >> update.chain.
> > >>
> > >> Is there a way to keep track of the documents ids imported from
> > >> /dataimport?
> > >> Any advice or pointers will be really helpful.
> > >>
> > >> Thanks,
> > >> Dileepa
> > >>
> > >
> > >
> >
> >
>



-- 


Regards,
Varun Thacker
http://www.vthacker.in/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message