metamodel-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kasper Sørensen <i.am.kasper.soren...@gmail.com>
Subject Re: [DISCUSS] use folder name as schema name for file based DataContexts
Date Wed, 14 Aug 2013 12:58:43 GMT
With those different preferences, we could even consider making
something like a "TableNameFactory" which converts filenames into
table names. But I guess the crucial point is which default convention
to use.

Underscoring makes it a bit cleaner to look at the column or table
paths, but it also makes the representation less direct. A user could
start wondering if there are other characters than dots that will be
replaced by underscores etc.

It should be noted that MM's parser does support dots in both table
and schema names, so this is probably mostly a question of aesthetics.

The ambiguity that you point out is also interesting. So far I haven't
seen it appear in real life, but technically it could occur that you
had two pairs of schemas and tables that would generate a ambigious
table path. For instance:

Schema: foo.bar
Table: baz

and

Schema: foo
Table: bar.baz

The parser would currently favor the second schema ("foo") since it
incrementally tries for schema/table/column matches with every
dot-separated token. An improvement to the parser would be to allow
quote characters, so that you could express your table path like this
then:

"foo.bar".baz

Also I want to note that some databases do support dots in
schema/table/column names, so this ambiguity can (although rarely)
also occur in a RDBMS or other data sources. It would also be quite
common with some separator (not necesarily a dot) in NoSQL database
column names, to indicate a nested field. In HBase for instance they
are referred using colon, like this: "columnFamily:column".

All in all I am mostly feeling like preserving the dots from the
filenames, but am also very curious what other people think!

2013/8/14 Hans Drexler <Hans.Drexler@humaninference.com>:
> Hi,
>
> First I agree with bumping this issue. When at the customer, this thing caused a lot
of time spent in figuring out what was going on. I am not sure if I like the extension as
part of the table name, because:
> - I would never create a table in a relational database with a dot in the name
> - It creates a ambiguity. If you have a "full" path name to a column, like " documents.people.csv.name
", then it is not clear if the schema name is "documents.people" and the table name is "csv",
or that the schema name is "documents" and the table name is "people.csv". It seems natural
to me that schema names contain dots, but not table names.
>
> Alternatives:
> - Leave the extension out of the name (probably not acceptable, because then you can
no longer have two "tables" differing only in extension). Although I must say that personally
I think this would be the best solution.
>
> - Use a conventional name, like:
> Schema name: Folder name
> Table name: The filename, including extension (all dots replaced by underscores).
> Resulting in e.g. a column path like this:
> documents.people_csv.name
>
> At the customer site, the file I needed to use was actually called like this pattern:
"bar/FOO.PEOPLE.IN.FILE". Using the convention, this would become:
> bar.FOO_PEOPLE_IN_FILE
>
> IMHO this is preferable to  "bar.foo.people.in.file"
>
> The problem is of course that it would now be impossible to have another file "bar/FOO_PEOPLE_IN_FILE"
:-(
>
> I am happy to hear other peoples thougths.
>
>
> Hans
>
>
> -----Original Message-----
> From: Kasper Sørensen [mailto:i.am.kasper.sorensen@gmail.com]
> Sent: Wednesday, August 14, 2013 10:18 AM
> To: dev@metamodel.incubator.apache.org
> Subject: Re: [DISCUSS] use folder name as schema name for file based DataContexts
>
> Rats, made a mistake in that diff. The Gist has been updated [1] and now contains the
ResourceUtils class which was missing before.
> [1] https://gist.github.com/kaspersorensen/6210970
>
> 2013/8/12 Kasper Sørensen <i.am.kasper.sorensen@gmail.com>:
>> Here's a proposed patch (implemented for CSV and fixedwidth files
>> which are the modules that implemented the old schema naming pattern):
>> https://gist.github.com/kaspersorensen/6210970
>>
>> 2013/8/10 Kasper Sørensen <i.am.kasper.sorensen@gmail.com>:
>>> https://issues.apache.org/jira/browse/METAMODEL-4
>>>
>>> 2013/8/10 Henry Saputra <henry.saputra@gmail.com>:
>>>> What is the JIRA for this one?
>>>>
>>>>
>>>> On Fri, Aug 9, 2013 at 2:26 AM, Manuel van den Berg <
>>>> Manuel.vandenBerg@humaninference.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> (shouldn't I just vote on the Jira for this?)
>>>>>
>>>>> manuel
>>>>>
>>>>> > -----Original Message-----
>>>>> > From: Kasper Sørensen [mailto:i.am.kasper.sorensen@gmail.com]
>>>>> > Sent: Friday, August 09, 2013 9:03
>>>>> > To: dev@metamodel.incubator.apache.org
>>>>> > Subject: Re: [DISCUSS] use folder name as schema name for file
>>>>> > based DataContexts
>>>>> >
>>>>> > Allow me to bump this issue (it's my impression that more people
>>>>> > have
>>>>> joined
>>>>> > in a bit late, after this topic was posted).
>>>>> >
>>>>> > I think this is one of the more important issues that I would
>>>>> > want to fix before we make our first release at Apache.
>>>>> >
>>>>> > 2013/7/24 Kasper Sørensen <i.am.kasper.sorensen@gmail.com>:
>>>>> > > Right now we have this slightly odd naming convention for
>>>>> > > schema and table names when building metadata for e.g. a CSV
>>>>> > > file or a fixed width value file.
>>>>> > >
>>>>> > > Schema name: The filename, including file extension.
>>>>> > > Table name: The filename without extension.
>>>>> > > Resulting in e.g. a column path like this:
>>>>> > > people.csv.people.name
>>>>> > >
>>>>> > > I suggest we change it to this convention:
>>>>> > >
>>>>> > > Schema name: Folder name
>>>>> > > Table name: The filename, including file extension.
>>>>> > > Resulting in e.g. a column path like this:
>>>>> > > documents.people.csv.name
>>>>> > >
>>>>> > > Why do I think this would be an improvement?
>>>>> > >
>>>>> > > 1) Because this would first of all make a kind of sense to
the
>>>>> > > user to see the file system's hierarchy reflected in the schema
model.
>>>>> > > 2) Because it allows us to make these DataContext's operate
not
>>>>> > > on a single file, but on a directory of files. I have seen
this
>>>>> > > quite a number of times by now that users of MetaModel, or
users of e.g.
>>>>> > > DataCleaner, which uses MetaModel quite heavily, wants to do
>>>>> > > this sort
>>>>> of
>>>>> > stuff.
>>>>> > > 3) The removing of the file extension stuff is kind of broken
>>>>> > > and a strange convention in the first place.
>>>>> > >
>>>>> > > While this doesn't really break backwards compatibility in
>>>>> > > terms of Java code, it would break configuration files and
>>>>> > > other stuff of applications that use MetaModel. But I do
>>>>> > > believe that can be communicated and handled through carefully
>>>>> > > explaining the new convention on the migration page (that I
recently started writing [1]).
>>>>> > >
>>>>> > > What do you think?
>>>>> > >
>>>>> > > [1]
>>>>> > > http://wiki.apache.org/metamodel/MigratingFromEobjectsMetaModel
>>>>>

Mime
View raw message