lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robichaud, Jean-Philippe" <Jean-Philippe.Robich...@nuance.com>
Subject RE: "Catalog" backend for document stored fields?
Date Tue, 14 Nov 2006 20:46:59 GMT
[sorry for the long delay for my answer, we are having some issues with our
mail server...]

Thanks for your comment.  Yes it would make sense if the log files were not
so big.  In fact, I'm only indexing a subset of the log information.
Because I store the information in Lucene, it is easier and faster to
retrieve the information.  

For example, generating a report by reading the logs themselves takes ~18
hours.  Converting to xml and indexing the logs takes ~12 hours and each
report takes <20 minutes to generate.  Since we often generate (different)
reports, the Lucene approach is way faster. [Note that I wrote both classes
to convert to xml and to index the logs.  The log2xml step is quite useful
because the xml really has all the log information and compressing the xml
with xmlppm reduces the size of the logs by 97%.  This way I can archive the
log.xml without wasting much space.]

As another comparison point: the logs takes 100Gig per week while the
indices 35Gig per month!  This design is already more optimal than the first
approach.  But I'm trying to make it better.

I really do think that this dictionary/catalog approach could benefit others
Lucene users.  I'm not against the idea of doing it myself.  I just need
some pointers and guidelines for all the gurus out there!

Thanks for all you help!

Jp
-----Original Message-----
From: Doron Cohen [mailto:DORONC@il.ibm.com] 
Sent: Tuesday, October 24, 2006 1:50 AM
To: java-user@lucene.apache.org
Subject: Re: "Catalog" backend for document stored fields?

> I'm indexing logs from a transaction-based application.
> ...
> millions documents per month, the size of the indices is ~35 gigs per
month
> (that's the lower bound).  I have no choice but to 'store' each field
values
> (as well as indexing/tokenizing them) because I'll need to retrieve them
in
> order to create various reports.  Also, I have a backlog of ~2 years of
logs
> to index!
> ...
> 1-       is there someone out there that already wrote an extension to
> Lucene so that 'stored' string for each document/field is in fact stored
in
> a centralized repository? Meaning, only an 'index' is actually stored in
the
> document and the real data is put somewhere else.

Do you gain anything from storing the document fields within Lucene?  In
case not, especially if log files are kept somewhere, you cuold make all
'content' fields unstored (reduce index size), and add a stored non-indexed
ID field. It can also be a POINTER field - e.g. <log file name + start
offset + length>.  At search time, for found documents you can retrieve
this ID/POINTER field and then fetch the document from the (original) log
file. Makes sense?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message