manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Salih Sen <sa...@dilisim.com>
Subject Re: Metadata fields get lost in 1.7.2 with Sharepoint 2013 repository and Solr output connection
Date Fri, 09 Jan 2015 11:12:39 GMT
Hi Karl,

It turns out we hit this bug because I left null ouput connection in
job settings before adding the Solr repository.

In any case it's good to know It'll be fixed in newer version :)

Thanks.

On Fri, Jan 9, 2015 at 8:25 AM, Karl Wright <daddywri@gmail.com> wrote:
> There is a fix committed, and a patch available that you can use with 1.7.
>
> Thanks,
> Karl
>
> On Thu, Jan 8, 2015 at 1:27 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> I was able to reproduce this using an RSS connection as input.  Any
>> bifurcation of the pipeline seems to cause only one metadata field to be
>> transmitted to the outputs, for reasons as yet unclear.
>>
>> CONNECTORS-1138.
>>
>> Karl
>>
>>
>>
>>
>> On Thu, Jan 8, 2015 at 11:36 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Actually, I take some of this back.  Any SharePoint metadata that is
>>> associated with a parent object rather than a child is represented in
>>> RepositoryDocument as a Reader[] array.  So you should see
>>> RepositoryDocumentFactory iterating through all such fields and making a
>>> TempFileCharacterInput for each member of each field.  If you are seeing
>>> only one iteration of the getFields() iterator, it means that the
>>> RepositoryDocument object fields member is not properly being managed.  But
>>> I'm looking that RepositoryDocument code, and addField() looks like it does
>>> the right thing for all variations of data types.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Jan 8, 2015 at 11:24 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Salih,
>>>>
>>>> The code you point at is designed to make copies of fields that are
>>>> represented by Reader objects.  Most SharePoint fields are represented by
>>>> String objects, so this code does not apply to them.
>>>>
>>>> The place you want to look is:
>>>>
>>>> >>>>>>
>>>>     // Copy metadata fields (including minting new Readers where needed)
>>>>     Iterator<String> iter = original.getFields();
>>>>     if (iter.hasNext())
>>>>     {
>>>>       String fieldName = iter.next();
>>>>       Object[] objects = original.getField(fieldName);
>>>>       if (objects instanceof Reader[])
>>>>       {
>>>>         CharacterInput[] rts = metadataReaders.get(fieldName);
>>>>         Reader[] newReaders = new Reader[rts.length];
>>>>         for (int i = 0; i < rts.length; i++)
>>>>         {
>>>>           rts[i].doneWithStream();
>>>>           newReaders[i] = rts[i].getStream();
>>>>         }
>>>>         rd.addField(fieldName,newReaders);
>>>>       }
>>>>       else if (objects instanceof Date[])
>>>>       {
>>>>         rd.addField(fieldName,(Date[])objects);
>>>>       }
>>>>       else if (objects instanceof String[])
>>>>       {
>>>>         rd.addField(fieldName,(String[])objects);
>>>>       }
>>>>       else
>>>>         throw new RuntimeException("Unknown kind of metadata:
>>>> "+objects.getClass().getName());
>>>>     }
>>>>
>>>> <<<<<<
>>>>
>>>> This code should copy all fields to the new RepositoryDocument object
>>>> (rd), and do the necessary special manipulation for Reader fields.
>>>>
>>>> If you'd be willing to send me a screen shot of your job (from your view
>>>> job page), I can try to recreate your pipeline here and see what's going
on.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Thu, Jan 8, 2015 at 11:13 AM, Salih Sen <salih@dilisim.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We've noticed that metadata of some documents aren't indexed in Solr.
>>>>>
>>>>> I tried tracking down to issue in source code and noticed that
>>>>> RepositoryDocument
>>>>> has around 25 fields until it reaches the RepositoryDocumentFactory.
>>>>>
>>>>> Document that returned from
>>>>>
>>>>> factory.createDocument()
>>>>>
>>>>> has only a single field in IncrementalIngester.java line 3089.
>>>>>
>>>>>
>>>>>
>>>>> I couldn't get the logic behind if (iter.hasNext()) in the code below
>>>>> while
>>>>> it has twenty something fields it "iterates" on only the first one.
>>>>> Is is the expected behaviour?
>>>>>
>>>>> A similar code also exist in createDocument() method so I feel I might
>>>>> be
>>>>> looking at the wrong places but as far as I can see this part creates
>>>>> the
>>>>> difference between the document comes from Sharepoint repository and
the
>>>>> one posted to Solr.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> RepositoryDocumentFactory.java
>>>>> ---------------------------------------------
>>>>>
>>>>> public RepositoryDocumentFactory(RepositoryDocument document)
>>>>>   throws ManifoldCFException, IOException
>>>>> {
>>>>>   this.original = document;
>>>>>
>>>>>   try
>>>>>   {
>>>>>     this.binaryTracker = new TempFileInput(document.getBinaryStream());
>>>>>     // Copy all reader streams
>>>>>     Iterator<String> iter = document.getFields();
>>>>>     if (iter.hasNext())
>>>>>     {
>>>>>       String fieldName = iter.next();
>>>>>       Object[] objects = document.getField(fieldName);
>>>>>       if (objects instanceof Reader[])
>>>>>       {
>>>>>         CharacterInput[] newValues = new CharacterInput[objects.length];
>>>>>         metadataReaders.put(fieldName,newValues);
>>>>>         // Populate newValues
>>>>>         for (int i = 0; i < newValues.length; i++)
>>>>>         {
>>>>>           newValues[i] = new TempFileCharacterInput((Reader)objects[i]);
>>>>>         }
>>>>>       }
>>>>>     }
>>>>>   }
>>>>>   catch (Throwable e)
>>>>>   {
>>>>>     // Clean up everything we've done so far.
>>>>>     if (this.binaryTracker != null)
>>>>>       this.binaryTracker.discard();
>>>>>     for (String key : metadataReaders.keySet())
>>>>>     {
>>>>>       CharacterInput[] rt = metadataReaders.get(key);
>>>>>       for (CharacterInput r : rt)
>>>>>       {
>>>>>         if (r != null)
>>>>>           r.discard();
>>>>>       }
>>>>>     }
>>>>>     if (e instanceof IOException)
>>>>>       throw (IOException)e;
>>>>>     else if (e instanceof RuntimeException)
>>>>>       throw (RuntimeException)e;
>>>>>     else if (e instanceof Error)
>>>>>       throw (Error)e;
>>>>>     else
>>>>>       throw new RuntimeException("Unknown exception type:
>>>>> "+e.getClass().getName()+": "+e.getMessage(),e);
>>>>>   }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Salih Şen
>>>>>
>>>>> Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret
>>>>> Ltd.
>>>>> Sti.
>>>>>
>>>>> email: salih@dilisim.com
>>>>>
>>>>> Tel: 0 222 330 20 21
>>>>>
>>>>> GSM: 0 507 296 15 51
>>>>>
>>>>
>>>>
>>>
>>



-- 
Salih Şen

Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd. Sti.

email: salih@dilisim.com

Tel: 0 222 330 20 21

GSM: 0 507 296 15 51

Mime
View raw message