lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Taylor>
Subject Re: Extracting contents of zipped files with Tika and Solr 1.4.1
Date Tue, 25 Jan 2011 15:32:36 GMT
Thanks Erlend.

Not used SVN before, but have managed to download and build latest trunk 

Now I'm getting an error when trying to access the admin page (via 
Jetty) because I specify HTMLStripStandardTokenizerFactory in my 
schema.xml, but this appears to be no-longer supplied as part of the 
build so I get an exception cos it can't find that class.  I've checked 
the CHANGES.txt and found the following in the change list to 1.4.0 (!?) :

66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, 
HTMLStripWhitespaceTokenizerFactory and    
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, 
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)

Unfortunately, I can't seem to get that to work correctly.  Does anyone 
have an example fieldType stanza (for schema.xml) for stripping out HTML ?

Thanks and kind regards,

On 25/01/2011 14:17, Erlend GarĂ¥sen wrote:
> On 25.01.11 11.30, Erlend GarĂ¥sen wrote:
>> Tika version 0.8 is not included in the latest release/trunk from SVN.
> Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry.
> And to clarify, by "content" I mean the main content of a Word file. 
> Title and other kinds of metadata are successfully extracted by the 
> old 0.4 version of Tika, but you need a newer Tika version (0.8) in 
> order to fetch the main content as well. So try the newest Solr 
> version from trunk.
> Erlend

View raw message