manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend Garåsen <e.f.gara...@usit.uio.no>
Subject Re: [TIP] Workaround for Solr bugs when Indexing Solr 1.4.1
Date Mon, 04 Apr 2011 15:01:34 GMT

After I downloaded and replaced the following jars, I no longer have a 
character encoding problem:
pdfbox-1.5.0.jar
fontbox-1.5.0.jar
jempbox-1.5.0.jar

Erlend

On 31.03.11 14.35, Karl Wright wrote:
> It might be worth cross-posting this to the Tika user or dev list.
> Jukka Zitting is one of the principal Tika developers and he's also a
> committer for MCF, but I'm not sure he'll notice it go by otherwise.
>
> In case you're wondering how to update the MCF FAQ, it's in the Wiki
> so all you need to do is sign up and you'll be able to update it.
> https://cwiki.apache.org/confluence/display/CONNECTORS/FAQ
>
> Karl
>
> On Thu, Mar 31, 2011 at 6:59 AM, Erlend Garåsen<e.f.garasen@usit.uio.no>  wrote:
>>
>> Oh, there's more unfortunately. Some of the Tika dependencies need to be
>> further updated. I couldn't parse the date from PDF documents correctly. I'm
>> not quite sure which of the extracting libraries causing this problem
>> (probably pdfbox). Anyway, I can now extract contents from the following
>> document formats without any problems:
>> - HTML
>> - RTF
>> - DOC
>> - DOCX
>> - ODT
>> - XLSX
>> - XLS
>> - SXW
>> - PDF
>>
>> I'm using the following jars:
>> apache-solr-cell-1.4.2-dev.jar
>> geronimo-stax-api_1.0_spec-1.0.1.jar
>> poi-scratchpad-3.7.jar
>> asm-3.1.jar
>> icu4j-4_6.jar
>> rome-0.9.jar
>> bcmail-jdk15-1.45.jar
>> jempbox-1.3.1.jar
>> tagsoup-1.2.jar
>> bcprov-jdk15-1.45.jar
>> metadata-extractor-2.4.0-beta-1.jar
>> tika-core-0.8.jar
>> boilerpipe-1.1.0.jar
>> netcdf-4.2.jar
>> tika-parsers-0.8.jar
>> commons-compress-1.1.jar
>> pdfbox-1.3.1.jar
>> commons-logging-1.1.1.jar
>> poi-3.7.jar
>> xercesImpl-2.8.1.jar
>> dom4j-1.6.1.jar
>> poi-ooxml-3.7.jar
>> xml-apis-1.0.b2.jar
>> fontbox-1.3.1.jar
>> poi-ooxml-schemas-3.7.jar
>> xmlbeans-2.3.0.jar
>>
>> But I still have some problems with PDF documents[1]. I'm not sure whether
>> it is a pdfbox bug, but Norwegian characters like æ, ø and å cannot be
>> displayed correctly after Solr has indexed the document. The characters are
>> replaced by a question mark.
>>
>> [1] http://ridder.uio.no/dokument.pdf
>>
>> Erlend
>>
>> On 30.03.11 18.09, Karl Wright wrote:
>>>
>>> Certainly it makes sense to start with the FAQ, especially for places
>>> where you are tripping over known bugs.  We can always do a site page
>>> later.
>>>
>>> Thanks!
>>> Karl
>>>
>>> On Wed, Mar 30, 2011 at 12:07 PM, Erlend Garåsen
>>> <e.f.garasen@usit.uio.no>    wrote:
>>>>
>>>> On 30.03.11 18.00, Karl Wright wrote:
>>>>>
>>>>> It would be great if this information went at least into the FAQ, and
>>>>> even better if we added a page to the site documentation.  I'm
>>>>> thinking maybe a whole page titled "Integrating with Solr", which
>>>>> would walk you through the process and the pitfalls.  What do you
>>>>> think?
>>>>
>>>> Yes, I think so.
>>>>
>>>> The next version of Solr will probably be released soon, and then it will
>>>> be
>>>> much easier to integrate Solr. Maybe it is sufficient to add the
>>>> information
>>>> into the FAQ since the problem mentioned only affects 1.4.1?
>>>>
>>>> Erlend
>>>>
>>>> --
>>>> Erlend Garåsen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>> 31050
>>>>
>>
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message