jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Müller" <thomas.muel...@day.com>
Subject Re: Workspace.copy() Question ...
Date Mon, 17 Nov 2008 09:02:25 GMT
Hi,

> would you store the dataidentifier in the index
> and so in all modules ?

Probably not in the Lucene index files itself. Text extraction could
be used without using the Lucene index, for example to display the
text content of a PDF file. The text extraction module could store the
DataIdentifier together with the extracted text ('payload'). The
advantage to store this 'payload' near the actual binary is that the
data is deleted when the binary is garbage collected. So maybe it
actually is better to store the 'payload' (extracted text, virus
scanner flag, thumbnail) near the binary, so it is automatically
garbage collected when the binary is garbage collected. We would need
to define an API and the behavior for this 'payload storage'. It
probably doesn't need to be transactional, but it needs to be
consistent (a checksum). Some kind of binary properties file maybe,
with put(String key, InputStream payload), and InputStream get(String
key).

> But what will you do in the case if you try to copy
> a node internaly .. the datastore should know that he must not read the binary
> to prevent extra read and write to the datastore.

Exactly. The DataStore should also check if the InputStream is a
DataStoreInputStream, so maybe it doesn't need to copy the binary:

if(in instanceof DataStoreInputStream) {
    DataIdentifier di = ((DataStoreInputStream) in).getDataIdentifier();
    if (exists(di)) {
        // already exists, no need to copy
        return;
    }
}
... create a new entry as done now ...

The 'instanceof' is not nice, but like that we don't have to change
the API, and we can use DataStoreInputStream also everywhere in the
JCR API (Node.setProperty(String name, InputStream in)).

> can you explain this a little bit more .. i dont know what
> viruscan and thumbnails have to do with that problem.

So far we were talking about text extraction: Text extraction should
only process each distinct binary once. But the virus scanner should
also process each distinct binary only once. And you only need to
create a thumbnail once for each distinct image. Currently there is no
way for the virus scanner to detect that it has already processed the
binary. However if we return a DataStoreInputStream, the virus scanner
module could check if it has already scanned the binary:

class VirusScanner {
    public void scan(InputStream in) throws VirusFoundException {
        if(in instanceof DataStoreInputStream) {
            DataIdentifier di = ((DataStoreInputStream) in).getDataIdentifier();
            if (hasScanned(di)) {
                // already exists, no need to copy
                return;
            } else {
                doScan(in);
                addScanned(di);
            }
        } else {
            doScan(in);
        }

The code for creating thumbnails would look similar.

Regards,
Thomas

Mime
View raw message