jmeter-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milamber <milam...@apache.org>
Subject Re: Add Apache Tika in JMeter to extract text from various file type
Date Mon, 05 Nov 2012 14:00:13 GMT


Le 05/11/2012 11:26, sebb a ecrit :
> On 3 November 2012 19:23, Milamber<milamber@apache.org>  wrote:
>> Hello,
>>
>> Currently, I work to add Apache Tika 1.2 [1] in JMeter to improve functional
>> tests.
>>
>> With Tika, you can extract the text form various documents, like MS Office
>> (Word, Excel, PowerPoint 97-2003, 2007-2010 (openxml), OpenOffice (writer,
>> calc, impress), HTML, Gz, jar/zip files (list of content), and some
>> "multimedia" files like mp3, mp4, flv, etc.
>>
>> In JMeter, Tika can be used by the View Results Tree to view the text data
>> of this files, Regular extractor to catch some text from this files and
>> Response assertion to assert on the data.
>>
>> The inconvenient is: Apache Tika requires a big jar (25Mb) or a lot of jar
>> files (see below). With all jars in the binary package, the new size (for
>> tgz) is 45 Mb (JMeter 2.8 tgz : 23Mb)
>>
>> The question: are you agree to add Tika (and new capability to "extract text
>> from Document") in JMeter with the new binary size?
>>
>> Secondary question: what the good way? : 1/ Add only tika-app.jar (which
>> include all dependencies) [2], or 2/ Add several jar files (tika-core,
>> tika-parser, etc + dependencies) [3]
> I'm concerned that using Tika would double the size of JMeter.
> Although the extra features would be useful, I suspect that most test
> cases won't need the extra functionality.
>
> Would it be possible to make the Tika jars optional?
> i.e. add the functionality, but if the jars are not present it is disabled.

Yes seems possible via a dynamic class control / loading


>
> If we accept that developers must download Tika, then it should be
> easy enough to structure the add-on so that JMeter can fail gracefully
> if the jars are missing.
> But ideally developers would not need to download all the jars either.

Currently, to compile the "tika" elements, we must have only these jars :
tika-core.jar
tika-parsers.jar

To the binary release, we needs had these jars (full list):
apache-mime4j-core.jar
apache-mime4j-dom.jar
asm.jar
aspectjrt.jar
boilerpipe.jar
commons-compress.jar
dom4j.jar
fontbox.jar
geronimo-stax-api_1.0_spec.jar
gson.jar
isoparser.jar
jempbox.jar
juniversalchardet.jar
log4j.jar
metadata-extractor.jar
netcdf.jar
pdfbox.jar
poi-ooxml-schemas.jar
poi-ooxml.jar
poi-scratchpad.jar
poi.jar
rome.jar
slf4j-api.jar
slf4j-log4j12.jar
tagsoup.jar
tika-core.jar
tika-parsers.jar
tika-xmp.jar
vorbis-java-core.jar
vorbis-java-tika.jar
xmlbeans.jar
xmpcore.jar
xz.jar

Or only the tika-app.jar (25Mb)


So, we can add the "tika" functionalities with dynamic class loading, 
add some warning messages to indicate the download of tika-app.jar if 
you want have the tika behavior

For View Results Tree, when the "Document" combo list is choosed: a 
message in Response data to indicate the missing tika-app.jar (with some 
indication where download it)

For RegExp and Response Assertion, if missing tika-app.jar, a warning 
dialog to show the message when the radio button "Response as a 
Document" is selected

And in all cases, a warning message in jmeter.log.




>


Mime
View raw message