lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregor Heinrich <gre...@arbylon.net>
Subject Re: lucene.index.*: extending Lucene to store topic model data ?
Date Fri, 19 Nov 2010 08:41:27 GMT
Hi Uwe -- thanks for this great hint. Is it considered stable enough to throw 
corpora at it that have 100MB etc. raw text?

ps -- sorry for staying cryptic about the actual application. I tried to 
abstract its relation to Lucene... Basically it's about automatically 
associating queries and documents with groups of related terms (topics) and thus 
improving recall. I wrote an introductory note about this stuff that may give an 
overview and cites much of the original literature: 
http://www.arbylon.net/publications/text-est2.pdf .

All the best

gregor


On 11/19/10 9:07 AM, Uwe Schindler wrote:
> Hi Gregor,
>
> I do not come from your area, so I don't understand all the stuff you are
> writing about, but from what you write, it looks that you are interested in
> the new flexible indexing coming with Lucene 4.0 aka Lucene trunk? Currently
> flexible indexing only allows to modify term dictionary and posting lists
> currently (the 4-dim Enum api in Lucene), but in the future we will also
> allow to modify index format of stotred fields/term vectors. We already
> started to have patches that allow per-field/document statistics for BM25
> scoring.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>> -----Original Message-----
>> From: Gregor Heinrich [mailto:gregor@arbylon.net]
>> Sent: Friday, November 19, 2010 8:50 AM
>> To: dev@lucene.apache.org
>> Subject: lucene.index.*: extending Lucene to store topic model data ?
>>
>> Dear list -- a question on potential storage of data originating from
> "topic
>> models" like LSA (latent semantic analysis) and LDA (latent Dirichlet
> allocation).
>> Packages like Mahout or SemanticVectors allow extraction of latent topics
> from
>> an existing Lucene corpus. They don't have the storage of the actual
> latent
>> concepts integrated into Lucene's efficient backend. So storing those data
>> withing Lucene's segments may be a benefit for them.
>>
>> My question: In the IndexWriter backend, is there any reasonable way you
> can
>> think of to store extra information after segments have been created but
>> before a commit() ? (This way any IndexSearcher/Reader always sees a
>> consistent index.) Further, after the optimize() step, another
> modification of the
>> extra information in index should be possible.
>>
>> Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from
>> the information in the index and stores topic related data with the
> segments
>> currently active for indexing, but in extra files. The extra files contain
>> document-specific topic float vectors as well as segment-global float
> vectors.
>> During commit(), the extra files are merged with the segments (which
> involves
>> some math processing again). At the end of the indexing process, the LDA
>> algorithm is rerun, improving the topic model globally, thus again
> modifying
>> the extra files.
>>
>> What may be a point of departure? Adding a modified TermVector-like
> storage
>> approach and hooking it to extended Segment* classes?
>>
>> Best regards
>>
>> gregor
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
>> commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message