manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Metadata fields get lost in 1.7.2 with Sharepoint 2013 repository and Solr output connection
Date Thu, 08 Jan 2015 18:27:55 GMT
I was able to reproduce this using an RSS connection as input.  Any
bifurcation of the pipeline seems to cause only one metadata field to be
transmitted to the outputs, for reasons as yet unclear.

CONNECTORS-1138.

Karl




On Thu, Jan 8, 2015 at 11:36 AM, Karl Wright <daddywri@gmail.com> wrote:

> Actually, I take some of this back.  Any SharePoint metadata that is
> associated with a parent object rather than a child is represented in
> RepositoryDocument as a Reader[] array.  So you should see
> RepositoryDocumentFactory iterating through all such fields and making a
> TempFileCharacterInput for each member of each field.  If you are seeing
> only one iteration of the getFields() iterator, it means that the
> RepositoryDocument object fields member is not properly being managed.  But
> I'm looking that RepositoryDocument code, and addField() looks like it does
> the right thing for all variations of data types.
>
> Karl
>
>
> On Thu, Jan 8, 2015 at 11:24 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Salih,
>>
>> The code you point at is designed to make copies of fields that are
>> represented by Reader objects.  Most SharePoint fields are represented by
>> String objects, so this code does not apply to them.
>>
>> The place you want to look is:
>>
>> >>>>>>
>>     // Copy metadata fields (including minting new Readers where needed)
>>     Iterator<String> iter = original.getFields();
>>     if (iter.hasNext())
>>     {
>>       String fieldName = iter.next();
>>       Object[] objects = original.getField(fieldName);
>>       if (objects instanceof Reader[])
>>       {
>>         CharacterInput[] rts = metadataReaders.get(fieldName);
>>         Reader[] newReaders = new Reader[rts.length];
>>         for (int i = 0; i < rts.length; i++)
>>         {
>>           rts[i].doneWithStream();
>>           newReaders[i] = rts[i].getStream();
>>         }
>>         rd.addField(fieldName,newReaders);
>>       }
>>       else if (objects instanceof Date[])
>>       {
>>         rd.addField(fieldName,(Date[])objects);
>>       }
>>       else if (objects instanceof String[])
>>       {
>>         rd.addField(fieldName,(String[])objects);
>>       }
>>       else
>>         throw new RuntimeException("Unknown kind of metadata:
>> "+objects.getClass().getName());
>>     }
>>
>> <<<<<<
>>
>> This code should copy all fields to the new RepositoryDocument object
>> (rd), and do the necessary special manipulation for Reader fields.
>>
>> If you'd be willing to send me a screen shot of your job (from your view
>> job page), I can try to recreate your pipeline here and see what's going on.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Thu, Jan 8, 2015 at 11:13 AM, Salih Sen <salih@dilisim.com> wrote:
>>
>>> Hi,
>>>
>>> We've noticed that metadata of some documents aren't indexed in Solr.
>>>
>>> I tried tracking down to issue in source code and noticed that
>>> RepositoryDocument
>>> has around 25 fields until it reaches the RepositoryDocumentFactory.
>>> ​ ​
>>> Document that returned from
>>> ​ ​
>>> factory.createDocument()
>>> ​ ​
>>> has only a single field in IncrementalIngester.java line 3089.
>>>
>>>
>>>
>>> I couldn't get the logic behind if (iter.hasNext()) in the code below
>>> while
>>> it has twenty something fields it "iterates" on only the first one.
>>> Is is the expected behaviour?
>>>
>>> A similar code also exist in createDocument() method so I feel I might be
>>> looking at the wrong places but as far as I can see this part creates the
>>> difference between the document comes from Sharepoint repository and the
>>> one posted to Solr.
>>>
>>> Thanks.
>>>
>>>
>>> RepositoryDocumentFactory.java
>>> ---------------------------------​------------
>>>
>>> public RepositoryDocumentFactory(RepositoryDocument document)
>>>   throws ManifoldCFException, IOException
>>> {
>>>   this.original = document;
>>>
>>>   try
>>>   {
>>>     this.binaryTracker = new TempFileInput(document.getBinaryStream());
>>>     // Copy all reader streams
>>>     Iterator<String> iter = document.getFields();
>>>     if (iter.hasNext())
>>>     {
>>>       String fieldName = iter.next();
>>>       Object[] objects = document.getField(fieldName);
>>>       if (objects instanceof Reader[])
>>>       {
>>>         CharacterInput[] newValues = new CharacterInput[objects.length];
>>>         metadataReaders.put(fieldName,newValues);
>>>         // Populate newValues
>>>         for (int i = 0; i < newValues.length; i++)
>>>         {
>>>           newValues[i] = new TempFileCharacterInput((Reader)objects[i]);
>>>         }
>>>       }
>>>     }
>>>   }
>>>   catch (Throwable e)
>>>   {
>>>     // Clean up everything we've done so far.
>>>     if (this.binaryTracker != null)
>>>       this.binaryTracker.discard();
>>>     for (String key : metadataReaders.keySet())
>>>     {
>>>       CharacterInput[] rt = metadataReaders.get(key);
>>>       for (CharacterInput r : rt)
>>>       {
>>>         if (r != null)
>>>           r.discard();
>>>       }
>>>     }
>>>     if (e instanceof IOException)
>>>       throw (IOException)e;
>>>     else if (e instanceof RuntimeException)
>>>       throw (RuntimeException)e;
>>>     else if (e instanceof Error)
>>>       throw (Error)e;
>>>     else
>>>       throw new RuntimeException("Unknown exception type:
>>> "+e.getClass().getName()+": "+e.getMessage(),e);
>>>   }
>>> }
>>>
>>>
>>>
>>> --
>>>
>>> Salih Şen
>>>
>>> Dilişim Bilgi Bilgisayar ve İletişim Teknolojileri Sanayi ve Ticaret Ltd.
>>> Sti.
>>>
>>> email: salih@dilisim.com
>>>
>>> Tel: 0 222 330 20 21
>>>
>>> GSM: 0 507 296 15 51
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message