lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: stream of events never to know when it ends? how to index such things & search
Date Thu, 19 Feb 2009 15:04:08 GMT
My indexes have been much more static than yours, so I'll
defer indexing event logging recommendations to others. But as I
remember, the issue of indexing log files has been discussed
on the list before, a search of logfiles or log files in the
searchable archive might be useful.

Your problem is additionally complicated by the fact (I assume)
that you have two different indexes to worry about, the current
day in RAM and past days in FSDir? Or are you only really
worried about events for one day?

But you're right, it's expensive to reconstruct a document
from the index and there's no way to get the unstemmed
version out that I know of . I could ask, if there are a small
number of tokens (inferred from your "highly redundant")
you're bothering to stem, but that's an aside....

I wonder if you could reduce your index size by storing
an encoded version of your "redundant" event IDs. Crudely,
you could store an int for each event rather than the
text, but that depends upon whether you could absolutely
define *all* the events. Don't know if that'd help or not....

About PositionIncrementGap etc. When you call
doc.add("field", "here's some data") on the *same*
field in the *same* document, a call is made to
your Analyzer.getPositionIncrementGap. This is the
*additional* offset to add to the next token for calls
2-N. Here's an example:

doc.add("field", "first set of tokens");
doc.add("field", "second bunch of tokens");

Let's assume that you have an Analyzer that
returns 100 for gePositionIncrementGap rather
than the default of 1. Note that this is an
overridable method so you can do anything you
want.....

The term positions of the tokens will be something
like:
first - 0
set - 1
of - 2
tokens - 3
second - 103
bunch - 104
of - 105
tokens - 106


Proximity queries (see the query syntax) allow you to say,
in effect, "only match if the desired tokens are within X of each
other". SpanQueries are Querys that you programmatically
construct that can extend this idea (see the classes in the
JavaDocs).

The use here is that if I submit the query "first bunch"~10 the above
document won't match since "first" is more than 10 away from "bunch"
but "first set"~10 *will* match. The (possible) application in your
situation is if you did manage to use one document per event ID, but
did NOT want terms in searches to match across sub-events for
that ID, you could use this mechanism to insure that. Simply choose
an IncrementGap greater than the maximum number of terms in
an event description,  then when you want to search in the
description field, just use a proximity less than the IncrementGap.
It may not apply at all for you, but that's the idea.....


Sorry I can't be more help
Erick

On Thu, Feb 19, 2009 at 8:25 AM, Christian Brennsteiner <
christian@brennsteiner.at> wrote:

> hi erick,
>
> nr of events are 107/sec in average with 400/sec peak and 20/sec low.
> between searchable should be less than 20 minutes. we are planning to
> index IN RAM only for a duration of one day MAX. per lucene process on
> the operating system.
>
> currently we need 500 M RAM for indexing one day (just storing the
> eventids and indexing (without storing) highly redundant event
> descriptions. collecting all eventdescriptions costs us additionally
> 3G ram (which is very much :-( for us.)
>
> @PositionIncrementGap or SpanQueries or the proximity operator ...
> sorry i am bloody beginner i don't really kow what you are talking
> about.
>
> a real update would be perfect... but i think from the current design
> it is not possible to extract all unstemmed keywords from a HIT? or is
> this possible?
> update then would be:
>
> search eventid
> get hits (should be one)
> extract all keywords from hit
> add new information plus hits newly to the index
> delete the hit.
>
> is there a possibility to gather detailed information about the index
> itself, that i can give you a detailed idea how big / and in which
> condition it is?
>
> regards chris
>
>
>
>
>
> On Wed, Feb 18, 2009 at 5:38 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
> > You could always sort by EVENTID, that way at least
> > you'd have all the events for a particular ID together
> > in your results. You'd have to post-filter the results to
> > determine whether all the necessary descriptions were
> > present. But I don't think this works all that well because,
> > as you pointed out, you may have a lot of records to
> > sort through so I don't think this is a very good idea...
> >
> >
> >
> > How many events are we talking about here and what
> > kind of lag between an event and being able to search it
> > can you tolerate? I guess what I'm really asking is whether
> > it's possible to recreate your index "often enough" to
> > satisfy your users. If so, you can index multiple
> > descriptions in a single document, something like
> >
> > doc.add("EVENTDESCRIPTION", "STARTING EVENT")
> > doc.add("EVENTDESCRIPTION", "XYZ")
> > doc.add("EVENTDESCRIPTION", "ABC")
> > doc.add("EVENTID", "1")
> > IndexWriter.addDocument(doc);
> >
> >
> > You'd have to gather all the descriptions related
> > to each EVENTID before you were able to index the doc.....
> >
> > By manipulating the PositionIncrementGap you could also
> > keep searches from matching across different EVENTDESCRIPTIONs,
> > e.g. if you didn't want to match +STARTING +ABC you could use
> > SpanQueries or the proximity operator, but going into details
> > depends upon whether you can rebuild your index so we'll defer
> > that part....
> >
> > You could also think about updating the document when new events
> > were added, but since an update is really a delete/add under the
> > covers you'd have to either gather enough information from what I
> > assume is your log or store enough information with the document to
> > recreate it.
> >
> > How big is your index currently and what kind of throughput do you
> > require?
> >
> > Best
> > Erick
> >
> >
> > On Wed, Feb 18, 2009 at 10:20 AM, Christian Brennsteiner <
> > christian@brennsteiner.at> wrote:
> >
> >> dear lucene community,
> >>
> >> i am playing around with lucene right now. and have come to very bad
> >> problem.
> >>
> >> given environment:
> >>
> >> a signal source gives signals with eventids ans eventdescriptions
> >>
> >> for example EVENTID=1 and EVENTDESCRIPTION="STARTING EVENT"
> >>
> >> those events can be running very long (e.g. one month) during this
> >> period we will receive for example
> >>
> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING XYZ"
> >> 10 minutes later
> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING YZA"
> >> 10 minutes later
> >> EVENTID=1 and EVENTDESCRIPTION="PASSED MILESTONE1"
> >> 10 minutes later
> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING ZAB"
> >>
> >> after e.g. 1 week we receive
> >> EVENTID=1 and EVENTDESCRIPTION="STOPING EVENT"
> >>
> >> what i want:
> >> i want to be able to search e.g. which eventids are connected to "XYZ"
> >> AND "ZAB" AND have already passed "MILESTONE1"
> >>
> >> so my current try is to index all events by full indexing (without
> >> storing) eventdescriptions AND stemming e.g. EXECUTING
> >>
> >> then searching for "+XYZ +ZAB +MILESTONE1"
> >> --> result no document since those are all seperated documents
> >> when i search
> >>  "XYZ ZAB MILESTONE1"
> >> i am getting 3 times EVENTID 3
> >> --> this is bad since when i get 1000000 of such events how do i rank
> them?
> >>
> >> CONCLUSION:
> >> my biggest problem is that my lucene document given to the index
> >> currently is not in a final state BUT i have to index and search it
> >> also while it is in progress.
> >> as a result of this the ranking as i do it now has no real value since
> >> the ranking is just based on a "line of a whole event"
> >>
> >> QUESTION:
> >> is there a solution within lucene to combine search results? e.g. merge
> >> them OR
> >> is there a better workaround how i would do such updates to the index
> >> without storing the original docmuent inside the index (since this
> >> consumes so many space)? e.g. extracting the keywords that were stored
> >> for the item?
> >>
> >> any hints appreciated.
> >>
> >> regards chris
> >>
> >>
> >> ----------
> >> Christian Brennsteiner
> >> Salzburg / Austria / Europe
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
>
> --
> ----------
> Christian Brennsteiner
> Salzburg / Austria / Europe
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message