jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From KÖLL Claus <C.KO...@TIROL.GV.AT>
Subject AW: Workspace.copy() Question ...
Date Tue, 18 Nov 2008 12:09:30 GMT
hi guys,

for my understanding ...

>Probably not in the Lucene index files itself. Text extraction could be used without using
the Lucene index, for example to display the text content of a >PDF file. The text extraction
module could store the DataIdentifier together with the extracted text ('payload'). The advantage
to store this 'payload' >near the actual binary is that the data is deleted when the binary
is garbage collected. So maybe it actually is better to store the 'payload' (extracted >text,
virus scanner flag, thumbnail) near the binary, so it is automatically garbage collected when
the binary is garbage collected. We would need to >define an API and the behavior for this
'payload storage'. It probably doesn't need to be transactional, but it needs to be consistent
(a checksum). >Some kind of binary properties file maybe, with put(String key, InputStream
payload), and InputStream get(String key).

... thomas you are talking about a textextraction module but i can not follow you,
As far as i understand, you will change the architectur going to modules
such like textactraction, viruscanner or thumbnailbuilder so they can store the informations
the dataidentifier and the result in the datastore ?
if the gc will delete a entry in the datastore based on the dataIdentifier the
"near" informations will also be deleted automatilally ?

as you wrote the main problem is that we do not know if we have already processed a binary
it would be fine if we internally create a dataIdentifier of a binary stream and give them
to such modules like textextractor or what else so they can look for a result that
was already be processed and stored.

is that also what you think ;-)

View raw message