lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregor Heinrich <gre...@arbylon.net>
Subject lucene.index.*: extending Lucene to store topic model data ?
Date Fri, 19 Nov 2010 07:49:35 GMT
Dear list -- a question on potential storage of data originating from "topic 
models" like LSA (latent semantic analysis) and LDA (latent Dirichlet 
allocation). Packages like Mahout or SemanticVectors allow extraction of latent 
topics from an existing Lucene corpus. They don't have the storage of the actual 
latent concepts integrated into Lucene's efficient backend. So storing those 
data withing Lucene's segments may be a benefit for them.

My question: In the IndexWriter backend, is there any reasonable way you can 
think of to store extra information after segments have been created but before 
a commit() ? (This way any IndexSearcher/Reader always sees a consistent index.) 
Further, after the optimize() step, another modification of the extra 
information in index should be possible.

Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from the 
information in the index and stores topic related data with the segments 
currently active for indexing, but in extra files. The extra files contain 
document-specific topic float vectors as well as segment-global float vectors. 
During commit(), the extra files are merged with the segments (which involves 
some math processing again). At the end of the indexing process, the LDA 
algorithm is rerun, improving the topic model globally, thus again modifying the 
extra files.

What may be a point of departure? Adding a modified TermVector-like storage 
approach and hooking it to extended Segment* classes?

Best regards

gregor



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message