lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <>
Subject RE: Solrj/Tika question about content types
Date Wed, 13 Feb 2013 20:04:18 GMT
Wow, Hoss, this post was so long ago I barely remember writing it. ;-)

The problem we were having is not that the content type is not set in SolrJ - it's that SolrCell
does not discover it as it did when we used multipart posts and ran with Solr 3.6.  We still
aren't sure where the change is that broke the Tika content-type-discovery functionality,
or whether it is in Tika or in Solr, but we did set the content type in the content stream
from the source, where possible, and that helped enormously.

The specific test case we had was an SJIS text file, which in Solr 3.6 is properly discovered
to be SJIS, while in Solr 4.1 it is only discovered to be sjis if we set a content type other
than application/octet-stream.


-----Original Message-----
From: ext Chris Hostetter [] 
Sent: Wednesday, February 13, 2013 2:53 PM
Subject: RE: Solrj/Tika question about content types

: questions still apply: since Tika apparently cares deeply about
: content-type now, what content-type can I supply through SolrJ to tell
: it 'please discover the document type on your own'?  And how do I do
: that through SolrJ?

SolrJ sets the Content-Type header based on what is returned by he "getContentType()" of the
ContentStream -- the default behavior is "application/octet-stream" if getContentType() returns

: (1) Does the getContentType() method actually even get used on Solrj?  
: When I looked at wire logging, it seemed that Solrj just posts a generic
: "application/xml; charset=UTF-8" content type, and does not transmit
: anything else.  It uses standard POST, not multipart/form POST, also.

Even in the case of a single ContentStream (so no multi-part) it still uses ContentStream.getContentType()
... can you provide a test case (or quick and dirty sample code) that demonstrates what you
are seeing with "application/xml; charset=UTF-8" getting sent over the wire even though you
explicitly provide a diff content-type in the ContentStream?


To unsubscribe, e-mail: For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message