lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From software visualization <softwarevisualizat...@gmail.com>
Subject Re: Using Lucene to search live, being-edited documents
Date Sat, 22 Jan 2011 15:55:09 GMT
Lance, Umesh thank you.

Lance I will look into this and report results when I try it out. Thank you
very much!

 Umesh: Just thinking along these lines, when a user saves the document,
that event may have a  semantic meaning  that developers aren't privy to.
The user might be experimenting with the document in some way as a means of
gaining useful knowledge but specifically and deliberately NOT saving it.
Users surprise you, no?

For this reason, I am not inclined to create a dependency between search and
save; they should be able to search at any time  irrespective of whether
they've saved.

But what's to stop us from hooking this up to a timer, say and indexing it
every so often?  It's not perfect, since  indexing = I/O = possibly
noticeable delays (but the CPU can carry on doing useful things of course).


The only problem in this scenario besides the not-really-real-time aspect to
it is if the user decides- again for their own good reasons - not to save
but rather to back out of he changes they have in memory; is the index now
"ahead"  of the document? I suppose in that case I blow away the existing
index with the  version of the document the user did decide to save adn
there's no discontinuity.



Just thinking aloud now for anyone interested  in the same problem.

On Sat, Jan 22, 2011 at 2:24 AM, Umesh Prasad <umesh.iitk@gmail.com> wrote:

> Nopes. It won't be the case always. Users will not be always editing
> the document. They will edit the document, then save which will be
> persisted in db. You can use db triggers to push it into a indexing
> queue, from which indexer can regularly pick up the document for
> indexing. You can schedule your indexer so that it picks the indexing
> jobs every minute or so.
>    Unless you have a mission critical system, the approach should be
> more than sufficient.
>
>
>
> On Sat, Jan 22, 2011 at 10:58 AM, software visualization
> <softwarevisualization@gmail.com> wrote:
> > If I understand you correctly, I think that this :
> >
> > If T2 < T1, Skip the result.
> >
> > will always  be the case. The live being edited document is always
> "later"
> > in time than the indexed information about it.
> >
> >
> >
> > On Fri, Jan 21, 2011 at 9:11 PM, Umesh Prasad <umesh.iitk@gmail.com>
> wrote:
> >
> >> Hi,
> >>   One work around would be to version the documents and store the
> >> version as well as the timestamp of indexed document into the index.
> >>
> >> Reading between lines I assume that
> >> Document is
> >> a) stored in some DB/File :
> >> b) indexed in lucene index
> >>
> >> User Search  On on b)
> >> Document ids
> >> but documents are displayed to user after retrieving from a).
> >>
> >> Now I do not know a way in which I can keep a) and b) completely in
> >> sync in realtime. As there will be some time taken in indexing
> >> operation itself. a) --> b) .
> >>
> >> Instead we can do following.
> >> a) stored : Document ID + Document Text + Document Version +
> >> Modification Time Stamp (T1)
> >> b) Indexed : Document ID + Document Text + Document Version +
> >> Modification Time Stamp (T2) (when indexed) (broken into date + hour +
> >> mins + sec for minimizing number of terms)
> >>
> >> User Searches b)
> >> Search System gets Document ID + Modification Time Stamp (T2) and gives
> to
> >> Presentation layer which compares the  T1 & T2.
> >> If T2 < T1, Skip the result.
> >>
> >> Assumption : Stored document is always in sync. Documents are
> >> persisted somewhere and not served from memory.
> >>
> >> Thanks & Regards
> >> Umesh Prasad
> >>
> >>
> >>
> >> On Sat, Jan 22, 2011 at 1:29 AM, software visualization
> >> <softwarevisualization@gmail.com> wrote:
> >> > Hi sorry for the long delay.
> >> >
> >> > The idea is that a single user is editing a single document. As they
> >> edit,
> >> > any indexes built against the document become stale, actually wrong.
> >> > Example:  references to specific localities within this document are
> all
> >> > instantly wrong the first time a user types a new beginning
>  character-
> >> > they're all off by one. Deleting  words is of course disastrous etc.
> etc.
> >> >  So our story is- we used to have this document nicely indexed and now
> we
> >> > have nothing useful.
> >> >
> >> > Considering what Lucene does prior to indexing, stemming for instance,
>  I
> >> am
> >> > not sure no, I am quite sure I can't  recreate the same powerful
> indexing
> >> > functionality.
> >> >
> >> > But it seems wrong  to lure our users into opening this document with
> >> > promises that this that and the other thing is has been located for
> them
> >> > only to invalidate all that just because they began to edit the
> document.
> >> I
> >> > understand why that happens , but my users are perhaps not as tech
> savvy
> >> and
> >> > I think it will just feel "wrong" to them.
> >> >
> >> > So I am looking for a way around this.
> >> >
> >> >
> >> >
> >> > On Tue, Jan 4, 2011 at 1:25 PM, adasal <adam.saltiel@gmail.com>
> wrote:
> >> >
> >> >> I would think this is more like it.
> >> >> But the essential thing, so it seems to me, is whether there is a
> >> >> requirement for a serialised index, i.e. a more permanent record,
> aside
> >> >> from
> >> >> the saved document.
> >> >> Then, if there is a penalty to creating the index compared to regex,
> >> >> stringsearch or so, it is justified on other grounds.
> >> >> I think it is an interesting q. when does that requirement emerge?
> >> >> There is size of document.
> >> >> But there would also be field types. I think I have this right. This
> is
> >> >> really a classification system, so more than bare regex.
> >> >> There must be other criteria that apply to this use case, too?
> >> >>
> >> >> Adam
> >> >>
> >> >> p.s. we (in my work project) are just beginning to use Lucene for
> >> geometry
> >> >> objects and I am looking forward to understanding its use better,
> >> >> including,
> >> >> possibly, expanding it to other use cases apart from geo objects.
> >> >>
> >> >> On 3 January 2011 15:31, Robert Muir <rcmuir@gmail.com> wrote:
> >> >>
> >> >> > On Mon, Jan 3, 2011 at 10:16 AM, Grant Ingersoll <
> gsingers@apache.org
> >> >
> >> >> > wrote:
> >> >> > > There is also the MemoryIndex, which is in contrib and is
> designed
> >> for
> >> >> > one document at a time.  That being said, basic grep/regex is
> probably
> >> >> fast
> >> >> > enough.
> >> >> > >
> >> >> >
> >> >> > In cases where you are doing a 'find' in a document similar to
what
> a
> >> >> > wordprocessor would do (especially if you want to iterate
> >> >> > forwards/backwards through matches etc), you might want to consider
> >> >> > something like
> >> >> >
> >> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html
> >> >> >
> >> >> >
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >
> >> >> >
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> ---
> >> Thanks & Regards
> >> Umesh Prasad
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
>
> --
> ---
> Thanks & Regards
> Umesh Prasad
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message