lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <>
Subject [jira] [Updated] (SOLR-6475) SOLR-5517 broke the ExtractingRequestHandler / Tika content-type detection.
Date Tue, 14 Oct 2014 22:30:34 GMT


Steve Rowe updated SOLR-6475:
    Labels: Content-Type Tika difficulty-medium impact-medium  (was: Content-Type Tika)

> SOLR-5517 broke the ExtractingRequestHandler / Tika content-type detection.
> ---------------------------------------------------------------------------
>                 Key: SOLR-6475
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 4.7
>            Reporter: Dominik Geelen
>              Labels: Content-Type, Tika, difficulty-medium, impact-medium
> Hi,
> as discussed with "hoss" on IRC, i'm creating this Issue about a problem we recently
ran into:
> Our company uses Solr to index user-generated files for fulltext searching (PDFs, etc.)
by using the ExtractingRequestHandler / Tika. 
> Since we recently upgraded to Solr 4.9, the indexing process began to throw the following
exception: "Must specify a Content-Type header with POST requests" (in solr/servlet/,
line 684 in the 4.9 source).
> This behavior was introduced with SOLR-5517, but even as the Solr wiki states, Tika needs
the content-type to be empty or not present to trigger auto detection of the content- / mime-type.
> Since both features block each other, but are basically both correct behavior, "hoss"
suggested that Tika should be fixed to trigger the auto-detection on content-type "application/octet-stream"
too and i highly agree with this proposal.
> *Test case:*
> Just use the example from the ExtractingRequestHandler wiki page:
> {noformat}
> curl "http://localhost:8983/solr/update/extract?"
 --data-binary @tutorial.html  [-H 'Content-type:text/html']
> {noformat}
> but don't send the content-type, obviously. or you could just use the "SimplePostTool
(post.jar)" mentioned in the wiki, but i guess this would be broken now, too.
> *Proposed solution:*
> Fix the Tika content guessing in that way, that it also triggers the auto detection on
content-type "application/octet-stream".
> Thanks,
> Dominik

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message