beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kenneth Knowles <...@google.com>
Subject Re: Writing Out List<String>
Date Fri, 20 May 2016 18:46:39 GMT
Hi Jesse,

I'm having trouble following exactly where the trouble is arising, but let
me expand my main recommendation to be an edit of your code snippet (please
forgive any typos or type errors).

Original:
----------
orderedList
  .apply(TextIO.Write
    .withCoder(ListCoder.of(StringDelegateCoder.of(String.class)))
    .to("output/result"));


My main recommendation
---------------------
import static org.apache.beam.values.TypeDescriptors.strings;

orderedList
  .apply(MapElements.via(x -> x.toString()).withOutputType(strings())
  .apply(TextIO.Write.to("output/result"));


Another approach, which I do not recommend
--------------------------------------------------------------
orderedList
  .apply(TextIO.Write
    .withCoder(StringDelegateCoder.of(List.class))
    .to("output/result"));

I don't recommend it because StringDelegateCoder; it is really intended for
things like URI which have a canonical string representation for 1-1
conversions, not for readable human output.

If neither of these works for you, perhaps you could paste a larger snippet
of your pipeline.

Kenn

On Thu, May 19, 2016 at 9:32 PM, Jesse Anderson <jesse@smokinghand.com>
wrote:

> I'm writing out a PCollection<List<String>>. My goal is to write out each
> element in the list as a new line.
>
> The StringUtf8Coder also writes out a VarInt for the size of the bytes.
> The StringDelegateCoder with the ListCoder doesn't actually write out
> text.
>
> I think List<String> support should be added to TextIO.Write. Or maybe a
> new coder needs to be added that outputs text, with support for Lists, KVs,
> Sets, etc.
>
> On Thu, May 19, 2016 at 9:23 PM Kenneth Knowles <klk@google.com> wrote:
>
>> Hi Jesse,
>>
>> StringDelegateCoder does just what you have said: it encodes using
>> #toString() and decodes assuming a single-arg constructor.
>>
>> But by analogy with what you have written, and if I understand your goals
>> correctly, what you want here is TextIO.Write.withCoder(StringDelegateCoder.of(List.class))
>> since you want to base it on List#toString() not String#toString().
>>
>> That said, probably the best way to write a reliable and/or readable
>> format with TextIO.Write is to intentionally produce just the string you
>> want for your output format - including escaping newlines, etc - and then
>> use StringUtf8Coder.
>>
>> Kenn
>>
>> On Thu, May 19, 2016 at 9:00 PM, Jesse Anderson <jesse@smokinghand.com>
>> wrote:
>>
>>> I'm trying to write out a List<String> with TextIO.Write. The only
>>> supported type is String. I ended up writing an anonymous coder.
>>>
>>> I want to check if there is a a coder that I couldn't find that would
>>> just take an object and write out out the .toString() of it.
>>>
>>> I tried this:
>>>
>>> orderedList.apply(TextIO.Write.withCoder(ListCoder.of(StringDelegateCoder.of(String.class))).to("output/result"));
>>>
>>> But a VarInt is encoded along with everything. I'm looking for a coder
>>> that only writes out the UTF8.
>>>
>>> This functionality would be similar to Hadoop TextOutputFormat. It just
>>> runs a .toString before writing it out.
>>>
>>> In the anonymous coder I wrote, I hit a weird issue. This code just
>>> writes out a bunch of "\n". Yes, value is populated with data.
>>>           dataOutputStream.writeUTF(value);
>>>           dataOutputStream.writeUTF("\n");
>>>
>>> This code works:
>>>           byte[] bytes = value.getBytes(StandardCharsets.UTF_8);
>>>           dataOutputStream.write(bytes);
>>>           dataOutputStream.writeUTF("\n");
>>>
>>> I took this from the string coder. What's odd is that DOS' writeUTF
>>> should work too. Is there a reason why?
>>>
>>> Thanks,
>>>
>>> jesse
>>>
>>
>>

Mime
View raw message