lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Brennsteiner <christ...@brennsteiner.at>
Subject Re: stream of events never to know when it ends? how to index such things & search
Date Fri, 20 Feb 2009 07:46:09 GMT
hi erick,

ram and fsdir:
we will hold every day of the 30 days (in the past) in ram. we will
start a seperate process every 1 or 2 days which holds 1-2 days. i
think that FSDir might be too slow? never tested that .... my goal is
to search 30 days with indexes about 300-700 M / day -> 21 G (max)
within one second.

from my point of view it should be easilly possible to retrieve all
unstemmed tokens from a document at least at the time you are adding
it? or am i wrong? can i prestem them? the stemmed version might use
much less space when i attach it to the current day index. all days in
the past dont need this since they can rebuild themselves with almost
complete data (small problems with events spanning over several
days... but those are rare) since 99% complete within 1 hour.

encoding might be worth (maybe top 3000 terms?) doing A dictionary. i
think ziping is not an option since the payloads are far too small.


thanks for everything
chris








On Thu, Feb 19, 2009 at 4:04 PM, Erick Erickson <erickerickson@gmail.com> wrote:
> My indexes have been much more static than yours, so I'll
> defer indexing event logging recommendations to others. But as I
> remember, the issue of indexing log files has been discussed
> on the list before, a search of logfiles or log files in the
> searchable archive might be useful.
>
> Your problem is additionally complicated by the fact (I assume)
> that you have two different indexes to worry about, the current
> day in RAM and past days in FSDir? Or are you only really
> worried about events for one day?
>
> But you're right, it's expensive to reconstruct a document
> from the index and there's no way to get the unstemmed
> version out that I know of . I could ask, if there are a small
> number of tokens (inferred from your "highly redundant")
> you're bothering to stem, but that's an aside....
>
> I wonder if you could reduce your index size by storing
> an encoded version of your "redundant" event IDs. Crudely,
> you could store an int for each event rather than the
> text, but that depends upon whether you could absolutely
> define *all* the events. Don't know if that'd help or not....
>
> About PositionIncrementGap etc. When you call
> doc.add("field", "here's some data") on the *same*
> field in the *same* document, a call is made to
> your Analyzer.getPositionIncrementGap. This is the
> *additional* offset to add to the next token for calls
> 2-N. Here's an example:
>
> doc.add("field", "first set of tokens");
> doc.add("field", "second bunch of tokens");
>
> Let's assume that you have an Analyzer that
> returns 100 for gePositionIncrementGap rather
> than the default of 1. Note that this is an
> overridable method so you can do anything you
> want.....
>
> The term positions of the tokens will be something
> like:
> first - 0
> set - 1
> of - 2
> tokens - 3
> second - 103
> bunch - 104
> of - 105
> tokens - 106
>
>
> Proximity queries (see the query syntax) allow you to say,
> in effect, "only match if the desired tokens are within X of each
> other". SpanQueries are Querys that you programmatically
> construct that can extend this idea (see the classes in the
> JavaDocs).
>
> The use here is that if I submit the query "first bunch"~10 the above
> document won't match since "first" is more than 10 away from "bunch"
> but "first set"~10 *will* match. The (possible) application in your
> situation is if you did manage to use one document per event ID, but
> did NOT want terms in searches to match across sub-events for
> that ID, you could use this mechanism to insure that. Simply choose
> an IncrementGap greater than the maximum number of terms in
> an event description,  then when you want to search in the
> description field, just use a proximity less than the IncrementGap.
> It may not apply at all for you, but that's the idea.....
>
>
> Sorry I can't be more help
> Erick
>
> On Thu, Feb 19, 2009 at 8:25 AM, Christian Brennsteiner <
> christian@brennsteiner.at> wrote:
>
>> hi erick,
>>
>> nr of events are 107/sec in average with 400/sec peak and 20/sec low.
>> between searchable should be less than 20 minutes. we are planning to
>> index IN RAM only for a duration of one day MAX. per lucene process on
>> the operating system.
>>
>> currently we need 500 M RAM for indexing one day (just storing the
>> eventids and indexing (without storing) highly redundant event
>> descriptions. collecting all eventdescriptions costs us additionally
>> 3G ram (which is very much :-( for us.)
>>
>> @PositionIncrementGap or SpanQueries or the proximity operator ...
>> sorry i am bloody beginner i don't really kow what you are talking
>> about.
>>
>> a real update would be perfect... but i think from the current design
>> it is not possible to extract all unstemmed keywords from a HIT? or is
>> this possible?
>> update then would be:
>>
>> search eventid
>> get hits (should be one)
>> extract all keywords from hit
>> add new information plus hits newly to the index
>> delete the hit.
>>
>> is there a possibility to gather detailed information about the index
>> itself, that i can give you a detailed idea how big / and in which
>> condition it is?
>>
>> regards chris
>>
>>
>>
>>
>>
>> On Wed, Feb 18, 2009 at 5:38 PM, Erick Erickson <erickerickson@gmail.com>
>> wrote:
>> > You could always sort by EVENTID, that way at least
>> > you'd have all the events for a particular ID together
>> > in your results. You'd have to post-filter the results to
>> > determine whether all the necessary descriptions were
>> > present. But I don't think this works all that well because,
>> > as you pointed out, you may have a lot of records to
>> > sort through so I don't think this is a very good idea...
>> >
>> >
>> >
>> > How many events are we talking about here and what
>> > kind of lag between an event and being able to search it
>> > can you tolerate? I guess what I'm really asking is whether
>> > it's possible to recreate your index "often enough" to
>> > satisfy your users. If so, you can index multiple
>> > descriptions in a single document, something like
>> >
>> > doc.add("EVENTDESCRIPTION", "STARTING EVENT")
>> > doc.add("EVENTDESCRIPTION", "XYZ")
>> > doc.add("EVENTDESCRIPTION", "ABC")
>> > doc.add("EVENTID", "1")
>> > IndexWriter.addDocument(doc);
>> >
>> >
>> > You'd have to gather all the descriptions related
>> > to each EVENTID before you were able to index the doc.....
>> >
>> > By manipulating the PositionIncrementGap you could also
>> > keep searches from matching across different EVENTDESCRIPTIONs,
>> > e.g. if you didn't want to match +STARTING +ABC you could use
>> > SpanQueries or the proximity operator, but going into details
>> > depends upon whether you can rebuild your index so we'll defer
>> > that part....
>> >
>> > You could also think about updating the document when new events
>> > were added, but since an update is really a delete/add under the
>> > covers you'd have to either gather enough information from what I
>> > assume is your log or store enough information with the document to
>> > recreate it.
>> >
>> > How big is your index currently and what kind of throughput do you
>> > require?
>> >
>> > Best
>> > Erick
>> >
>> >
>> > On Wed, Feb 18, 2009 at 10:20 AM, Christian Brennsteiner <
>> > christian@brennsteiner.at> wrote:
>> >
>> >> dear lucene community,
>> >>
>> >> i am playing around with lucene right now. and have come to very bad
>> >> problem.
>> >>
>> >> given environment:
>> >>
>> >> a signal source gives signals with eventids ans eventdescriptions
>> >>
>> >> for example EVENTID=1 and EVENTDESCRIPTION="STARTING EVENT"
>> >>
>> >> those events can be running very long (e.g. one month) during this
>> >> period we will receive for example
>> >>
>> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING XYZ"
>> >> 10 minutes later
>> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING YZA"
>> >> 10 minutes later
>> >> EVENTID=1 and EVENTDESCRIPTION="PASSED MILESTONE1"
>> >> 10 minutes later
>> >> EVENTID=1 and EVENTDESCRIPTION="EXECUTING ZAB"
>> >>
>> >> after e.g. 1 week we receive
>> >> EVENTID=1 and EVENTDESCRIPTION="STOPING EVENT"
>> >>
>> >> what i want:
>> >> i want to be able to search e.g. which eventids are connected to "XYZ"
>> >> AND "ZAB" AND have already passed "MILESTONE1"
>> >>
>> >> so my current try is to index all events by full indexing (without
>> >> storing) eventdescriptions AND stemming e.g. EXECUTING
>> >>
>> >> then searching for "+XYZ +ZAB +MILESTONE1"
>> >> --> result no document since those are all seperated documents
>> >> when i search
>> >>  "XYZ ZAB MILESTONE1"
>> >> i am getting 3 times EVENTID 3
>> >> --> this is bad since when i get 1000000 of such events how do i rank
>> them?
>> >>
>> >> CONCLUSION:
>> >> my biggest problem is that my lucene document given to the index
>> >> currently is not in a final state BUT i have to index and search it
>> >> also while it is in progress.
>> >> as a result of this the ranking as i do it now has no real value since
>> >> the ranking is just based on a "line of a whole event"
>> >>
>> >> QUESTION:
>> >> is there a solution within lucene to combine search results? e.g. merge
>> >> them OR
>> >> is there a better workaround how i would do such updates to the index
>> >> without storing the original docmuent inside the index (since this
>> >> consumes so many space)? e.g. extracting the keywords that were stored
>> >> for the item?
>> >>
>> >> any hints appreciated.
>> >>
>> >> regards chris
>> >>
>> >>
>> >> ----------
>> >> Christian Brennsteiner
>> >> Salzburg / Austria / Europe
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> ----------
>> Christian Brennsteiner
>> Salzburg / Austria / Europe
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>



-- 
----------
Christian Brennsteiner
Salzburg / Austria / Europe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message