metamodel-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kasper Sørensen <i.am.kasper.soren...@gmail.com>
Subject Re: [DISCUSS] use folder name as schema name for file based DataContexts
Date Tue, 20 Aug 2013 08:30:09 GMT
Agreed on all. Except why should dots in column names be any different
than schema and table names?

2013/8/16 Hans Drexler <Hans.Drexler@humaninference.com>:
> I believe that probably, *every* convention will have its drawbacks. using a factory
can help on one hand, but it can also cause great confusion if things get mixed. It also makes
things more complex. If we clearly document the choice made, I will live with that.
>
> My main point is that  we should try to write and document the software in such way that
MetaModel users will not get confused. I like the quotes idea, since that will allow the user
to explicitely express what is intended. But then, lets extend it to something like this:
>
> "schema_name"."table_name"."column_name"
>
> Where schema_name and table_name can contain dots (".").   (I guess column names cannot...)
>
> I hope you don't mind me rambling about this...
>
> kind regards,
>
> Hans
>
> -----Original Message-----
> From: Kasper Sørensen [mailto:i.am.kasper.sorensen@gmail.com]
> Sent: Wednesday, August 14, 2013 2:59 PM
> To: dev@metamodel.incubator.apache.org
> Subject: Re: [DISCUSS] use folder name as schema name for file based DataContexts
>
> With those different preferences, we could even consider making something like a "TableNameFactory"
which converts filenames into table names. But I guess the crucial point is which default
convention to use.
>
> Underscoring makes it a bit cleaner to look at the column or table paths, but it also
makes the representation less direct. A user could start wondering if there are other characters
than dots that will be replaced by underscores etc.
>
> It should be noted that MM's parser does support dots in both table and schema names,
so this is probably mostly a question of aesthetics.
>
> The ambiguity that you point out is also interesting. So far I haven't seen it appear
in real life, but technically it could occur that you had two pairs of schemas and tables
that would generate a ambigious table path. For instance:
>
> Schema: foo.bar
> Table: baz
>
> and
>
> Schema: foo
> Table: bar.baz
>
> The parser would currently favor the second schema ("foo") since it incrementally tries
for schema/table/column matches with every dot-separated token. An improvement to the parser
would be to allow quote characters, so that you could express your table path like this
> then:
>
> "foo.bar".baz
>
> Also I want to note that some databases do support dots in schema/table/column names,
so this ambiguity can (although rarely) also occur in a RDBMS or other data sources. It would
also be quite common with some separator (not necesarily a dot) in NoSQL database column names,
to indicate a nested field. In HBase for instance they are referred using colon, like this:
"columnFamily:column".
>
> All in all I am mostly feeling like preserving the dots from the filenames, but am also
very curious what other people think!
>
> 2013/8/14 Hans Drexler <Hans.Drexler@humaninference.com>:
>> Hi,
>>
>> First I agree with bumping this issue. When at the customer, this thing caused a
lot of time spent in figuring out what was going on. I am not sure if I like the extension
as part of the table name, because:
>> - I would never create a table in a relational database with a dot in
>> the name
>> - It creates a ambiguity. If you have a "full" path name to a column, like " documents.people.csv.name
", then it is not clear if the schema name is "documents.people" and the table name is "csv",
or that the schema name is "documents" and the table name is "people.csv". It seems natural
to me that schema names contain dots, but not table names.
>>
>> Alternatives:
>> - Leave the extension out of the name (probably not acceptable, because then you
can no longer have two "tables" differing only in extension). Although I must say that personally
I think this would be the best solution.
>>
>> - Use a conventional name, like:
>> Schema name: Folder name
>> Table name: The filename, including extension (all dots replaced by underscores).
>> Resulting in e.g. a column path like this:
>> documents.people_csv.name
>>
>> At the customer site, the file I needed to use was actually called like this pattern:
"bar/FOO.PEOPLE.IN.FILE". Using the convention, this would become:
>> bar.FOO_PEOPLE_IN_FILE
>>
>> IMHO this is preferable to  "bar.foo.people.in.file"
>>
>> The problem is of course that it would now be impossible to have
>> another file "bar/FOO_PEOPLE_IN_FILE" :-(
>>
>> I am happy to hear other peoples thougths.
>>
>>
>> Hans
>>
>>
>> -----Original Message-----
>> From: Kasper Sørensen [mailto:i.am.kasper.sorensen@gmail.com]
>> Sent: Wednesday, August 14, 2013 10:18 AM
>> To: dev@metamodel.incubator.apache.org
>> Subject: Re: [DISCUSS] use folder name as schema name for file based
>> DataContexts
>>
>> Rats, made a mistake in that diff. The Gist has been updated [1] and now contains
the ResourceUtils class which was missing before.
>> [1] https://gist.github.com/kaspersorensen/6210970
>>
>> 2013/8/12 Kasper Sørensen <i.am.kasper.sorensen@gmail.com>:
>>> Here's a proposed patch (implemented for CSV and fixedwidth files
>>> which are the modules that implemented the old schema naming pattern):
>>> https://gist.github.com/kaspersorensen/6210970
>>>
>>> 2013/8/10 Kasper Sørensen <i.am.kasper.sorensen@gmail.com>:
>>>> https://issues.apache.org/jira/browse/METAMODEL-4
>>>>
>>>> 2013/8/10 Henry Saputra <henry.saputra@gmail.com>:
>>>>> What is the JIRA for this one?
>>>>>
>>>>>
>>>>> On Fri, Aug 9, 2013 at 2:26 AM, Manuel van den Berg <
>>>>> Manuel.vandenBerg@humaninference.com> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> (shouldn't I just vote on the Jira for this?)
>>>>>>
>>>>>> manuel
>>>>>>
>>>>>> > -----Original Message-----
>>>>>> > From: Kasper Sørensen [mailto:i.am.kasper.sorensen@gmail.com]
>>>>>> > Sent: Friday, August 09, 2013 9:03
>>>>>> > To: dev@metamodel.incubator.apache.org
>>>>>> > Subject: Re: [DISCUSS] use folder name as schema name for file
>>>>>> > based DataContexts
>>>>>> >
>>>>>> > Allow me to bump this issue (it's my impression that more people
>>>>>> > have
>>>>>> joined
>>>>>> > in a bit late, after this topic was posted).
>>>>>> >
>>>>>> > I think this is one of the more important issues that I would
>>>>>> > want to fix before we make our first release at Apache.
>>>>>> >
>>>>>> > 2013/7/24 Kasper Sørensen <i.am.kasper.sorensen@gmail.com>:
>>>>>> > > Right now we have this slightly odd naming convention for
>>>>>> > > schema and table names when building metadata for e.g.
a CSV
>>>>>> > > file or a fixed width value file.
>>>>>> > >
>>>>>> > > Schema name: The filename, including file extension.
>>>>>> > > Table name: The filename without extension.
>>>>>> > > Resulting in e.g. a column path like this:
>>>>>> > > people.csv.people.name
>>>>>> > >
>>>>>> > > I suggest we change it to this convention:
>>>>>> > >
>>>>>> > > Schema name: Folder name
>>>>>> > > Table name: The filename, including file extension.
>>>>>> > > Resulting in e.g. a column path like this:
>>>>>> > > documents.people.csv.name
>>>>>> > >
>>>>>> > > Why do I think this would be an improvement?
>>>>>> > >
>>>>>> > > 1) Because this would first of all make a kind of sense
to the
>>>>>> > > user to see the file system's hierarchy reflected in the
schema model.
>>>>>> > > 2) Because it allows us to make these DataContext's operate
>>>>>> > > not on a single file, but on a directory of files. I have
seen
>>>>>> > > this quite a number of times by now that users of MetaModel,
or users of e.g.
>>>>>> > > DataCleaner, which uses MetaModel quite heavily, wants
to do
>>>>>> > > this sort
>>>>>> of
>>>>>> > stuff.
>>>>>> > > 3) The removing of the file extension stuff is kind of
broken
>>>>>> > > and a strange convention in the first place.
>>>>>> > >
>>>>>> > > While this doesn't really break backwards compatibility
in
>>>>>> > > terms of Java code, it would break configuration files
and
>>>>>> > > other stuff of applications that use MetaModel. But I do
>>>>>> > > believe that can be communicated and handled through carefully
>>>>>> > > explaining the new convention on the migration page (that
I recently started writing [1]).
>>>>>> > >
>>>>>> > > What do you think?
>>>>>> > >
>>>>>> > > [1]
>>>>>> > > http://wiki.apache.org/metamodel/MigratingFromEobjectsMetaMode
>>>>>> > > l
>>>>>>

Mime
View raw message