manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [TIP] Workaround for Solr bugs when Indexing Solr 1.4.1
Date Mon, 04 Apr 2011 15:26:14 GMT
Good to know that it can be made to work. ;-)

We should probably look at Lucene/Solr 3.1, which was just released,
and is the next Solr version after 1.4.1, and see whether anything
special is needed there.

Karl


On Mon, Apr 4, 2011 at 11:01 AM, Erlend Garåsen <e.f.garasen@usit.uio.no> wrote:
>
> After I downloaded and replaced the following jars, I no longer have a
> character encoding problem:
> pdfbox-1.5.0.jar
> fontbox-1.5.0.jar
> jempbox-1.5.0.jar
>
> Erlend
>
> On 31.03.11 14.35, Karl Wright wrote:
>>
>> It might be worth cross-posting this to the Tika user or dev list.
>> Jukka Zitting is one of the principal Tika developers and he's also a
>> committer for MCF, but I'm not sure he'll notice it go by otherwise.
>>
>> In case you're wondering how to update the MCF FAQ, it's in the Wiki
>> so all you need to do is sign up and you'll be able to update it.
>> https://cwiki.apache.org/confluence/display/CONNECTORS/FAQ
>>
>> Karl
>>
>> On Thu, Mar 31, 2011 at 6:59 AM, Erlend Garåsen<e.f.garasen@usit.uio.no>
>>  wrote:
>>>
>>> Oh, there's more unfortunately. Some of the Tika dependencies need to be
>>> further updated. I couldn't parse the date from PDF documents correctly.
>>> I'm
>>> not quite sure which of the extracting libraries causing this problem
>>> (probably pdfbox). Anyway, I can now extract contents from the following
>>> document formats without any problems:
>>> - HTML
>>> - RTF
>>> - DOC
>>> - DOCX
>>> - ODT
>>> - XLSX
>>> - XLS
>>> - SXW
>>> - PDF
>>>
>>> I'm using the following jars:
>>> apache-solr-cell-1.4.2-dev.jar
>>> geronimo-stax-api_1.0_spec-1.0.1.jar
>>> poi-scratchpad-3.7.jar
>>> asm-3.1.jar
>>> icu4j-4_6.jar
>>> rome-0.9.jar
>>> bcmail-jdk15-1.45.jar
>>> jempbox-1.3.1.jar
>>> tagsoup-1.2.jar
>>> bcprov-jdk15-1.45.jar
>>> metadata-extractor-2.4.0-beta-1.jar
>>> tika-core-0.8.jar
>>> boilerpipe-1.1.0.jar
>>> netcdf-4.2.jar
>>> tika-parsers-0.8.jar
>>> commons-compress-1.1.jar
>>> pdfbox-1.3.1.jar
>>> commons-logging-1.1.1.jar
>>> poi-3.7.jar
>>> xercesImpl-2.8.1.jar
>>> dom4j-1.6.1.jar
>>> poi-ooxml-3.7.jar
>>> xml-apis-1.0.b2.jar
>>> fontbox-1.3.1.jar
>>> poi-ooxml-schemas-3.7.jar
>>> xmlbeans-2.3.0.jar
>>>
>>> But I still have some problems with PDF documents[1]. I'm not sure
>>> whether
>>> it is a pdfbox bug, but Norwegian characters like æ, ø and å cannot be
>>> displayed correctly after Solr has indexed the document. The characters
>>> are
>>> replaced by a question mark.
>>>
>>> [1] http://ridder.uio.no/dokument.pdf
>>>
>>> Erlend
>>>
>>> On 30.03.11 18.09, Karl Wright wrote:
>>>>
>>>> Certainly it makes sense to start with the FAQ, especially for places
>>>> where you are tripping over known bugs.  We can always do a site page
>>>> later.
>>>>
>>>> Thanks!
>>>> Karl
>>>>
>>>> On Wed, Mar 30, 2011 at 12:07 PM, Erlend Garåsen
>>>> <e.f.garasen@usit.uio.no>    wrote:
>>>>>
>>>>> On 30.03.11 18.00, Karl Wright wrote:
>>>>>>
>>>>>> It would be great if this information went at least into the FAQ,
and
>>>>>> even better if we added a page to the site documentation.  I'm
>>>>>> thinking maybe a whole page titled "Integrating with Solr", which
>>>>>> would walk you through the process and the pitfalls.  What do you
>>>>>> think?
>>>>>
>>>>> Yes, I think so.
>>>>>
>>>>> The next version of Solr will probably be released soon, and then it
>>>>> will
>>>>> be
>>>>> much easier to integrate Solr. Maybe it is sufficient to add the
>>>>> information
>>>>> into the FAQ since the problem mentioned only affects 1.4.1?
>>>>>
>>>>> Erlend
>>>>>
>>>>> --
>>>>> Erlend Garåsen
>>>>> Center for Information Technology Services
>>>>> University of Oslo
>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>> 31050
>>>>>
>>>
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Mime
View raw message