commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno P. Kinoshita" <brunodepau...@yahoo.com.br.INVALID>
Subject Re: [LANG] Add alphabet conversion API
Date Tue, 13 Sep 2016 22:00:46 GMT
+1
Bruno

 
      From: Benedikt Ritter <britter@apache.org>
 To: Commons Developers List <dev@commons.apache.org> 
 Sent: Wednesday, 14 September 2016 2:06 AM
 Subject: Re: [LANG] Add alphabet conversion API
   
Does this really belong into [LANG]? We also have Commons Text [1] in the
sandbox, which seems to be a better home for this functionality.

Benedikt

[1] http://commons.apache.org/sandbox/commons-text/

Rob Tompkins <chtompki@gmail.com> schrieb am Di., 13. Sep. 2016 um
15:48 Uhr:

>
> > On Sep 13, 2016, at 4:39 AM, Eyal Allweil <eyal_allweil@yahoo.com.INVALID>
> wrote:
> >
> > I've created a JIRA issue,
> https://issues.apache.org/jira/browse/LANG-1266, and a pull request for
> this: https://github.com/apache/commons-lang/pull/188
> > Regards,Eyal
> >
> >
> >
> >
> >    On Wednesday, September 7, 2016 5:27 PM, Eyal Allweil <
> eyal_allweil@yahoo.com> wrote:
> >
> >
> > Hi Simo,
> > I'm not sure I understood how BitSets would be used in this case. For
> example, an example with chars might look like this.
> > AlphabetConverter ac = new AlphabetConverter(['a','b','c','d'],
> ['a','e','f','g'],['a']) // 'a' is not encoded
>
> Hello Eyal,
>
> The first thing that springs to mind here is: are we naming this class
> appropriately? I’ll preface my naming argument with I’m coming from a
> mathematical background (combinatorics on words) here. Traditionally in the
> literature such a “mapping”
>
>        f: {Kleene Closure A} -> {Kleene Closure B}
>
> with the property f(StringConcatenate(x,y)) = StringConcatenate(f(x),f(y))
> for x,y strings from {Kleene Closure A}, is called a “Morphism” [1, pg.
> 8][2]. Clearly that name is quite terse when one comes from an application
> development mindset, so I’m not sure that going with the theoretical name
> is appropriate here. That said, I minimally wanted to bring it up so that
> we can have open discourse about naming.
>
> After looking at the code some, the following pop into my head (note. I’m
> not tied to any of the ideas here, just stating thoughts that ran through
> my head):
> There are some stylistic differences that stand out (e.g. "methodName
> (signature)" as opposed to “methodName(signature)”).
> More javadoc?
> Do we need the “doNotEncodeMap”?
> The “.equals" method could use a null check.
> Do we want to accommodate non-invertible or non-decodable encodings (e.g.
> new AlphabetConverter([‘a’,’b’,’c’,’d’],[‘a’,’e’,’f’,’e’],[‘a’]))?
> Do we want to accommodate alphabets over concatenated chars (e.g. new
> AlphabetConverter([‘ab’,’c’,’d’,e’],[‘a’,’k’,’hi’,’z’],[]))?
>
> Personally I like the idea of having the ability of having the
> generalization of the input/output alphabets, but it would seem that would
> require having a superclass have that implementation and an extension for
> an invertible AlphabetConverter.
>
> All that said, I’m not particularly tied to any of the ideas, and aside
> from the stylistic changes and the .equals bit, the changes seem quite
> reasonable. I would love to hear other folks’ thoughts on the proposed
> functionality.
>
> Cheers,
> -Rob
>
> Biblio.
> [1] Jean-Paul Allouche and Jeffrey Shallit. Automatic sequences. Cambridge
> University Press, Cambridge, 2003. Theory, ap- plications, and
> generalizations.
>
> [2] https://en.wikipedia.org/wiki/Free_monoid#Morphisms
>
> >
> > and the mapping would become a -> a, b -> e, c -> f, d -> g
> > so encoding encode("abc") would become "aef".
> > Ints can be used instead of chars to support unicode code points that
> don't fit in a single char (which was our case, but if that seems overkill,
> the chars implementation is much more direct).
> > How did you mean the BitSet to be used?
> > Regards,Eyal
> >
> >
> >
> >    On Thursday, September 1, 2016 12:26 PM, Simone Tripodi <
> simonetripodi@apache.org> wrote:
> >
> >
> > Hi,I personally think it would a very "nice to have" feature, I had to
> face similar issues in the past and, if that feature was available would
> have saved me developing time.
> > I just have a small request/suggestion: since int/char can be casted to
> each other, I would use BitSets rather than Sets.
> > Good luck!-Simo
> >
> > http://people.apache.org/~simonetripodi/
> > http://twitter.com/simonetripodi
> > On Thu, Sep 1, 2016 at 10:53 AM, Eyal Allweil <eyal_allweil@yahoo.com.invalid>
> wrote:
> >
> > Hi guys,
> > Would you be interested in adding a utility class that creates alphabet
> converters, perhaps using a helper method available from StringUtils? It
> doesn't have to stay the way it is now, but the API for the class -
> AlphabetConverter - is currently:
> > /** * The input is integers representing code points, but we can make it
> accept chars as well * * doNotEncode represents chars we want to leave in
> the original state (not to encode them using the chars in encoding) */
> > public AlphabetConverter(Set<Integer> original, Set<Integer> encoding,
> Set<Integer> doNotEncode);
> > public String encode (String original);
> >
> > public String decode (String encoded);
> > In StringUtils, we could add
> >
> > public AlphabetConverter getAlphabetConverter (Set<Integer> original,
> Set<Integer> encoding, Set<Integer> doNotEncode);
> > I used it to convert from unicode to latin letters, without using any
> chars I wanted as delimiters, and preserving the English alphabet as is for
> readability. If you'd like to add it, I'll clean up the code and prepare it
> for a pull request so you can review it.
> >
> > It makes sense to me to add a method that returns the HashMaps used
> internally for the mappings so they can be serialized (and deserialized)
> for preserving the mapping.
> > Regards,Eyal Allweil (PayPal)
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

   
 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message