lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar" <shalinman...@gmail.com>
Subject Re: DataImportHandler and Blobs
Date Wed, 12 Nov 2008 17:56:43 GMT
On Wed, Nov 12, 2008 at 10:44 PM, Grant Ingersoll <gsingers@apache.org>wrote:

> Am I understanding the DIH correctly in that it doesn't work with Blobs and
> or binary things?   I'm basing this off of JdbcDataSource.getARow() which
> seems to be the place that populates the Map that is then passed to the
> Transformer.


Actually, that switch statement in JdbcDataSource is redundant now. In our
initial patches, the "field" in data-config had a type attribute. We used to
attempt type conversion from the SQL type to the field's given type. We
found that it was error prone and switched to using the ResultSet#getObject
for all columns (making the old behavior a configurable option --
"convertType" in JdbcDataSource).

The default is to use ResultSet#getObject which should handle BLOBs and
CLOBs well.


> One of the things that I think might be interesting is, as I'm integrating
> Tika, the notion of a Transformer that takes a blob and feeds it to Tika for
> parsing.  In this way, people that store documents in databases (or download
> PDFs, etc.) can use the DIH to bring in more kinds of content.
>
> Thoughts?
>

I think the best way would be a TikaEntityProcessor which knows how to
handle documents. I guess a typical use-case would be
FileListEntityProcessor->TikaEntityProcessor as parent-child entities.

Also see SOLR-833 which adds a FieldReaderDataSource using which you can
pass any field's content to an entity for processing. So you can have a
[SqlEntityProcessor, JdbcDataSource] producing a blob and a
[FieldReaderDataSource, TikaEntityProcessor] consuming it.

I think such an integration will be very interesting. Let me know if you
need a hand, I'm willing to contribute in whatever way possible.

-- 
Regards,
Shalin Shekhar Mangar.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message