lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abdul aleem <janaabdulal...@yahoo.com>
Subject Re: Indexing clarification , please advice
Date Wed, 13 Dec 2006 14:25:00 GMT
Many thanks Erick,

Your points are valid, i was thinking entire Log file
as a lucene document, im wrong trying  to chop the log
file might be the way to go

my bad expressions , yes you got that right
timestamp must be added as a "FIELD" that is what i
meant

really appreciate your detailed reply,


regards,
Abdul 

--- Erick Erickson <erickerickson@gmail.com> wrote:

> Let me take a crack at it. See below...
> 
> On 12/13/06, abdul aleem <janaabdulaleem@yahoo.com>
> wrote:
> >
> > Hello All,
> >
> > Apolgies if it is a naive question
> >
> > a) Indexing large file ( more than 4MB )
> >    Do i need to read the entire file as string
> using
> >    java.io and create a Document object ?
> 
> 
> Essentially yes. IF you must index the whole
> document as a single Lucene
> document. my reply later for why you may not want to
> do this.
> 
> The following are equivalent.
> Document doc = new Document();
> doc.add("textfield", "data1 data2 data3");
> doc.add("textfield", "data4 data5 data6");
> 
> 
> and
> Document doc = new Document();
> doc.add("textfield", "data1 data2 data3 data4 data5
> data6");
> 
> So you could read chunks of your file and index them
> into the *same* field
> before writing the document to the index, or you
> could read the file as a
> single chunk and index it all at once.
> 
> HOWEVER: note that the default in Lucene is to index
> only the first 10,000
> (?) tokens for a field in a single document. See
> IndexWriter.SetMaxFieldLength.
> 
> 
> 
>    The file contains timestamp, if i need to index
> on
> >    timestamp is parsing the entire file manually
> > (tokenizing) and store the timestamp as document
> > object is the only way ? or is there any
> alternatives
> > ?
> 
> 
> Perhaps I don't understand the problem. You can
> store the timestamp as a
> *field* in a document. Is that what you mean by
> storing it as a "document
> object"? But there's no way I know of to have Lucene
> automatically do
> something with arbitrary text in the input stream.
> You could write a custom
> analyzer (see Lucene In Action for the Synonym
> Analyzer for a model). That
> Analyzer would be responsible for recognizing
> timestamps in the input stream
> and doing something special with them.
> 
> 
> b) I need to search the contents of a log file which
> > is changing rapidly, from initial testing i see if
> any
> > changes in the file is not reflected unless it is
> > *Indexed* again
> >
> >    Do we need to index the files always before
> search
> > if the content of the file is dynamically changed
> 
> 
> Yes. That is, you can't search data you haven't
> indexed.
> 
> 
>   ( log file has a pattern and always logs in a
> > similar fashion, each time i need to index takes
> lot
> > of time as the file is large (approach a )  is
> there
> > any work arounds for this ? )
> 
> 
> I think you need to re-think your approach. A
> document in Lucene is whatever
> you want to think of it as. For instance, you could
> index your log file such
> that each Lucene "document" was all the data added
> to the log file over some
> specified time interval. So, say you have a log that
> starts at midnight.
> Each Lucene "document" could be all the data added
> to the index for each
> minute. So you'd have a document for the data added
> to the log between 12:00
> and 12:00:59. Another "document" for all data
> between 12:01:00 and 12:01:59
> etc. That way, you don't have to re-index the entire
> log, just everything
> since the last interval you already indexed.
> 
> I'm not necessarily recommending this approach, but
> using it to illustrate
> that you don't need to think of a Lucene Document as
> your entire log file.
> You may be much better off slicing the data up
> somehow and having a
> one-to-many relationship between your log and the
> Lucene documents......
> 
> Perhaps you could index every message as an
> individual document (which would
> deal with your timestamp issue), or.....
> 
> Hope this helps
> Erick
> 
>   I would greatly appreciate if any inputs on the
> > above,
> >
> > Many thanks,
> > Abdul
> >
> >
> >
> >
> >
> >
> >
> >
>
____________________________________________________________________________________
> > Need a quick answer? Get one in minutes from
> people who know.
> > Ask your question on www.Answers.yahoo.com
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >
> >
> 



 
____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message