manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Determining document model passed to search engine
Date Mon, 11 Feb 2013 21:28:56 GMT
The elastic search connector always base-64 encodes the content.  I
gather that is standard for ElasticSearch.

Karl

On Mon, Feb 11, 2013 at 4:00 PM, Tony Edgin <tedgin.iplant@gmail.com> wrote:
> Thanks again.
>
> I just ran an example set up to understand better what you said.
>
> As you said, the web page URL get's set to the _id field.
> The metadata that is sent to Elastic Search is as follows:
>
>       header-Content-Type: "text/html; charset=UTF-8"
>       header-Content-Length: "3278"
>       header-Keep-Alive: "timeout=5, max=100"
>       header-Server: "Apache/2.2"
>       header-Connection: "Keep-Alive"
>       type: "attachment"
>       file: ...
>
> The file field looks to be base64 encoded.  Is this always the case, or is
> this unique to web repo + elastic search?
>
> This must be the web page. I'm guessing header-Content-Type field holds the
> document type and not the type field.
>
>
>
>
>
> On Mon, Feb 11, 2013 at 1:17 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> What emerges from the web connector is the following:
>>
>> -       metadata, which you define on the web connector’s “Metadata” tab,
>> that are named however you want;
>> -       forced acls, which get added to the document based on what you
>> select on the “Security” tab;
>> -       the document’s content type;
>> -       the document’s url;
>> -       the document itself.
>>
>> What the elastic search connector does is:
>> -       Map the document’s url to ElasticSearch’s document id field (which
>> I
>> guess shows up in Elastic Search as the ‘uri’ field)
>> -       Output all the metadata directly to ElasticSearch using the name
>> provided by the repository connector
>> -       Set the file value to “” (which seems wrong, since that could be
>> helpful if available - let me know if you think a fix for this would
>> be useful)
>> -       NONE of the rest of the document fields (content type, acls, etc)
>> are communicated to Elastic Search at all right now, except for the
>> document itself.
>>
>> Karl
>>
>>
>> On Mon, Feb 11, 2013 at 2:55 PM, Tony Edgin <tedgin.iplant@gmail.com>
>> wrote:
>> > Thanks for the speedy response!
>> >
>> > I eventually want to index the contents of our local website with
>> > Elastic
>> > Search.
>> >
>> > I would use the Web repository connector with the no authority connector
>> > and
>> > the Elasticsearch output connector.  Would you mind letting me know the
>> > names and meanings of the metadata that get's passed to Elastic Search?
>> >
>> > Thanks again.
>> >
>> >
>> > On Mon, Feb 11, 2013 at 12:45 PM, Karl Wright <daddywri@gmail.com>
>> > wrote:
>> >>
>> >> So let me get this clear - you are looking to find out what the
>> >> names/meanings are of the metadata that gets passed to the output
>> >> connector, for a given repository connection?
>> >>
>> >> If this is what you are looking for, I'm afraid that while at one
>> >> point the end-user documentation described this pretty accurately, it
>> >> is now significantly out of date.  While it's not terribly hard to
>> >> compile this information from source code etc., the work definitely
>> >> needs to be repeated by somebody.
>> >>
>> >> If you want to ask this question about a specific connector, I can
>> >> certainly try to answer it, though.  If you want to contribute either
>> >> the information or a documentation patch, this would be great too.
>> >>
>> >> Karl
>> >>
>> >> On Mon, Feb 11, 2013 at 2:38 PM, Tony Edgin <tedgin.iplant@gmail.com>
>> >> wrote:
>> >> > I'm sure this is documented somewhere, and I apologize in advance for
>> >> > not
>> >> > being able to find it.
>> >> >
>> >> > How do I determine the model or schema of the document passed to the
>> >> > search
>> >> > engine by a given job?
>> >> >
>> >> > For instance, I'm running a job that crawls a directory on my local
>> >> > file
>> >> > system and passes to to Elastic Search.  Interrogating Elastic
>> >> > Search, I
>> >> > can
>> >> > determine that the document has three fields, "file", "type" and
>> >> > "uri",
>> >> > all
>> >> > strings.  How would I have known that in advance?
>> >> >
>> >> > Thanks for any help.
>> >
>> >
>
>

Mime
View raw message