jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From go canal <goca...@yahoo.com>
Subject Re: full text search for CJK languages
Date Tue, 11 Aug 2009 06:50:44 GMT
Hi Thomas,
thank you very much.....

I have added the analyzer, excel files are ok now. but still have problems with my PDF file
- it seems that PDFBox is not able to handle some conditions, not a Jackrabbit problem. Here
is the error message:

13:45:40,453  WARN PdfTextExtractor:91 - Failed to extract PDF text content
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
at org.pdfbox.util..PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:75)
at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob$1.call(TextExtractorJob.java:93)
at EDU.oswego.cs.dl.util.concurrent.FutureResult$1.run(Unknown Source)
at org.apache.jackrabbit.core.query.lucene.TextExtractorJob.run(TextExtractorJob.java:172)
at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Thread.java:619)

I have tested some other PDF files, seems ok, I can have full text CJK search...so I suspect
that it may be a PDFBox limitation...

The PDF file giving me problems is generated with Distiller 9.0, PDF version 1.5. Nothing
special, or least I am not aware of.


From: Thomas Müller <thomas.mueller@day..com>
To: users@jackrabbit.apache.org
Sent: Monday, August 10, 2009 4:36:25 PM
Subject: Re: full text search for CJK languages


I'm not sure, but I think you need to use

class org.apache.lucene.analysis.cjk.CJKAnalyzer

See http://wiki.apache.org/jackrabbit/Search - parameter analyzer

Can you please verify this is correct? I will then update the documentation.


On Sun, Aug 9, 2009 at 4:38 PM, go canal<gocanal@yahoo.com> wrote:
> Just tested:
>  the default configuration supports full CJK text search for Text, Word and PPT file;
but can not search PDF/Excel files.
>  rgds,
> canal
> ________________________________
> From: go canal <gocanal@yahoo..com>
> To: users@jackrabbit.apache.org
> Sent: Sunday, August 9, 2009 10:20:28 PM
> Subject: full text search for CJK languages
> Hi,
> could not find detailed info wrt supporting full text search for 2-byte languages like
CJK (Chinese, Japanese and Korea).
> 1) anybody know if there is one such library available ? and
> 2) how to config this in Jackrabbit ? Should I replace all the extractors in the current
>    <SearchIndex .....
>      <param name="textFilterClasses"
>        value="org.apache.jackrabbit.extractor.PlainTextExtractor,
>         org.apache.jackrabbit.extractor.MsWordTextExtractor,
>   org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>   org.apache.jackrabbit..extractor.MsPowerPointTextExtractor,
>   org.apache.jackrabbit.extractor..PdfTextExtractor,
>   org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>   org.apache.jackrabbit.extractor.RTFTextExtractor,
>   org.apache.jackrabbit.extractor.HTMLTextExtractor,
>   org.apache.jackrabbit.extractor.XMLTextExtractor" />
>    </SearchIndex>
> rgds,
> canal

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message