drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Givre <cgi...@gmail.com>
Subject Re: Data types
Date Fri, 27 Jan 2017 16:53:06 GMT
I’m actually one of the contributors for the forthcoming O’Reilly book on Drill (along
with Ted and Ellen), and this is a specific functionality I’m planning on writing a chapter
about.  (Not the buffers, but how to get Drill to ingest other file formats)



 
> On Jan 27, 2017, at 11:50, Paul Rogers <progers@mapr.com> wrote:
> 
> Hi Charles,
> 
> Congrats! Unfortunately, no, there is no documentation. Drill seems to be of the “code
speaks for itself” persuasion. I try to document the bits I’ve had to learn on my Github
Wiki, but (until now) I’ve not looked at this particular area.
> 
> IMHO, now that the plugins basically work, the API could use a good scrubbing to make
it simpler, easier to document, and easier to use. As it is, you have to be an expert on Drill
internals to understand all the little knick-knacks that have to be in your code to make various
Drill subsystems happy.
> 
> That said, perhaps you can use your own Git Wiki to document what you’ve learned so
that we capture that for the next plugin developer.
> 
> Thanks,
> 
> - Paul
> 
>> On Jan 27, 2017, at 8:42 AM, Charles Givre <cgivre@gmail.com> wrote:
>> 
>> Hi Paul,
>> VICTORY!!  I just set the buffer size to 4096 and it worked perfectly without truncating
my data! 
>> Is this documented anywhere?  I’ve been trying to really wrap my head around the
mechanics of how Drill reads data and how the format plugins work and really haven’t found
much.  I’ve hacked together a few other plugins like this—which work—but if I could
find some docs, that would be great.
>> Thanks,
>> — Charles
>> 
>> 
>> 
>>> On Jan 27, 2017, at 02:11, Paul Rogers <progers@mapr.com> wrote:
>>> 
>>> Looks like I gave you advice that as a bit off. The function you want is either:
>>> 
>>>          this.buffer = fragmentContext.getManagedBuffer();
>>> 
>>> The above allocates a 256 byte buffer. You can initially allocate a larger one:
>>> 
>>>          this.buffer = fragmentContext.getManagedBuffer(4096);
>>> 
>>> Or, to reallocate:
>>> 
>>>         buffer = fragmentContext.replace(buffer, 8192);
>>> 
>>> Again, I’ve not used these method myself, but they seem they might do the trick.
>>> 
>>> - Paul
>>> 
>>>> On Jan 26, 2017, at 9:51 PM, Charles Givre <cgivre@gmail.com> wrote:
>>>> 
>>>> Thanks!  I’m hoping to submit a PR eventually once I have this all done.
 I tried your changes and now I’m getting this error:
>>>> 
>>>> 0: jdbc:drill:zk=local> select * from dfs.client.`small.misolog`;
>>>> Error: DATA_READ ERROR: Tried to remove unmanaged buffer.
>>>> 
>>>> Fragment 0:0
>>>> 
>>>> [Error Id: 52fc846a-1d94-4300-bcb4-7000d0949b3c on charless-mbp-2.fios-router.home:31010]
(state=,code=0)
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Jan 26, 2017, at 23:08, Paul Rogers <progers@mapr.com> wrote:
>>>>> 
>>>>> Hi Charles,
>>>>> 
>>>>> Very cool plugin!
>>>>> 
>>>>> My knowledge in this area is a bit sketchy… That said, the problem
appears to be that the code does not extend the Drillbuf to ensure it has sufficient capacity.
Try calling this method: reallocIfNeeded, something like this:
>>>>> 
>>>>>   this.buffer.reallocIfNeeded(stringLength);
>>>>>   this.buffer.setBytes(0, bytes, 0, stringLength);
>>>>>   map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>>>>> 
>>>>> Then, comment out the 256 length hack and see if it works.
>>>>> 
>>>>> To avoid memory fragmentation, maybe change your loop as:
>>>>> 
>>>>>        int maxRecords = MAX_RECORDS_PER_BATCH;
>>>>>        int maxWidth = 256;
>>>>>        while(recordCount < maxRecords &&(line = this.reader.readLine())
!= null){
>>>>>        …
>>>>>           if(stringLength > maxWidth) {
>>>>>              maxWidth = stringLength;
>>>>>              maxRecords = 16 * 1024 * 1024 / maxWidth;
>>>>>           }
>>>>> 
>>>>> The above is not perfect (the last record added might be much larger
than the others, causing the corresponding vector to grow larger than 16 MB, but the occasional
large vector should be OK.)
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> - Paul
>>>>> 
>>>>> On Jan 26, 2017, at 5:31 PM, Charles Givre <cgivre@gmail.com<mailto:cgivre@gmail.com>>
wrote:
>>>>> 
>>>>> Hi Paul,
>>>>> Would you mind taking a look at my code?  I’m wondering if I’m doing
this correctly.  Just for context, I’m working on a generic log file reader for drill (https://github.com/cgivre/drill-logfile-plugin
<https://github.com/cgivre/drill-logfile-plugin>), and I encountered some errors when
working with fields that were > 256 characters long.  It isn’t a storage plugin, but
it extends the EasyFormatPlugin.
>>>>> 
>>>>> I added some code to truncate the strings to 256 chars, it worked.  Before
this it was throwing errors as shown below:
>>>>> 
>>>>> 
>>>>> 
>>>>> Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))
>>>>> 
>>>>> Fragment 0:0
>>>>> 
>>>>> [Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on charless-mbp-2.fios-router.home:31010]
(state=,code=0)
>>>>> 
>>>>> 
>>>>> The query that generated this was just a SELECT * FROM dfs.`file`.  Also,
how do I set the size of each row batch?
>>>>> Thank you for your help.
>>>>> — C
>>>>> 
>>>>> 
>>>>> if (m.find()) {
>>>>> for( int i = 1; i <= m.groupCount(); i++ )
>>>>> {
>>>>>   //TODO Add option for date fields
>>>>>   String fieldName  = fieldNames.get(i - 1);
>>>>>   String fieldValue;
>>>>> 
>>>>>   fieldValue = m.group(i);
>>>>> 
>>>>>   if( fieldValue == null){
>>>>>       fieldValue = "";
>>>>>   }
>>>>>   byte[] bytes = fieldValue.getBytes("UTF-8");
>>>>> 
>>>>> //Added this and it worked….
>>>>>   int stringLength = bytes.length;
>>>>>   if( stringLength > 256 ){
>>>>>       stringLength = 256;
>>>>>   }
>>>>> 
>>>>>   this.buffer.setBytes(0, bytes, 0, stringLength);
>>>>>   map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>>>>> }
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Jan 26, 2017, at 20:20, Paul Rogers <progers@mapr.com<mailto:progers@mapr.com>>
wrote:
>>>>> 
>>>>> Hi Charles,
>>>>> 
>>>>> The Varchar column can hold any length of data. We’ve recently been
working on tests that have columns up to 8K in length.
>>>>> 
>>>>> The one caveat is that, when working with data larger than 256 bytes,
you must be extremely careful in your reader. The out-of-box text reader will always read
64K rows. This (due to various issues) can cause memory fragmentation and OOM errors when
used with columns greater than 256 bytes in width.
>>>>> 
>>>>> If you are developing your own storage plugin, then adjust the size of
each row batch so that no single vector is larger than 16 MB in size. Then you can use any
size of column.
>>>>> 
>>>>> Suppose your logs contain text lines up to, say, 1K in size. This means
that each record batch your reader produces must be of size less than 16 MB / 1K / row = 1600
rows (rather than the usual 64K.)
>>>>> 
>>>>> Once the data is in the Varchar column, the rest of Drill should “just
work” on that data.
>>>>> 
>>>>> - Paul
>>>>> 
>>>>> On Jan 26, 2017, at 4:11 PM, Charles Givre <cgivre@gmail.com<mailto:cgivre@gmail.com>>
wrote:
>>>>> 
>>>>> I’m working on a plugin to read log files and the data has some long
strings.  Is there a data type that can hold strings longer than 256 characters?
>>>>> Thanks,
>>>>> — Charles
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 


Mime
View raw message