crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Whitacre <mkw...@gmail.com>
Subject Re: (CRUNCH-184) Gora backend implementation: first attempt
Date Tue, 26 May 2015 22:43:29 GMT
Vincent,

Thanks for the patch.  First off feel free to go ahead and attach this to
the JIRA issue directly.[1]  We typically review patches that have been
attached there.

To answer some of your questions:

>> HBaseSourceTarget implements TableSource<..., ...>, but GoraSourceTarget
implements Source<Pair<K, V>>, Gora DataStore is a map and not a multimap.
Should it be a TableSource anyway ?

Not being familiar with Gora, do consumer typically interact with the data
in a K/V manner?  While PTable's can be multimaps they don't necessarily
have to be.  Making the data available as a PTable would make sense if
consumers would typically need to do joins/grouping on a key meaningful to
Gora.  As an example in HBase, a consumer might set a batching value that
would break up a single row.  Making grouping easier allows the consumer to
recombine the row for processing.

>>  GoraSourceIT test failure

When using the MRPipeline it will actually serialize and instantiate the
source on the cluster vs the instance you created in memory which is used
by the MemPipeline.  If you need values like start and end key which I see
in your GoraSourceTarget to be available when running on MRPipeline then
those will need to be properly configured.  Look at how the HBase impls
make scans available.  Also when quickly glancing through this guide saw
references to init calls.

http://gora.apache.org/current/tutorial.html#constructing-the-job

While you don't necessarily have to call those mappers you'll probably want
to make sure any config they are doing is handled in the Source/Target
setup.

>> - Should there be an equivalent to HBaseTypes.puts and
HBaseTypes.deletes with Gora?

Once again not familiar with Gora but the need for those are predicated on
how consumers would typically interact with the system.  Those
representations make it easy for consumers to perform standard operations
on HBase without having to worry about serializing the HBase Put into the
correct byte[] for the service to know what to do.  It doesn't look like
Gora necessarily have straight Puts/Deletes like HBase

https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/Put.html


So the question is how does one represent an insertion/deletion in the Gora
input format or output format?

>> Crunch & Eclipse warning:

You can ignore that lifecycle warning if you are building everything
through Maven and not Eclipse.  I believe that is just because the sources
being generated for the tests are not being handled by Eclipse when it is
trying to control that project.

>> More generally, what about code quality? (still junior...)

I haven't gotten a chance to do a deep review of your code.  But don't
worry about that we can help with that.

Thanks,
Micah


[1] - https://issues.apache.org/jira/browse/CRUNCH-184

On Mon, May 25, 2015 at 3:06 AM, Vincent Fabro <
vincent.fabro.nutch@gmail.com> wrote:

> Dear all
>
> A patch for a crude Gora backend implementation is attached. I copy-pasted
> the HBase implementation and made modifications.
>
> I have questions to push it further:
>
> - HBaseSourceTarget implements TableSource<..., ...>, but
> GoraSourceTarget implements Source<Pair<K, V>>, Gora DataStore is a map
> and not a multimap. Should it be a TableSource anyway ?
>
> - I made simple examples in GoraSourceIT (will be removed, no proper tests
> yet). You can read/write to a GoraSourceTarget when using MemPipeline, but
> MRPipeline gives the following error when reading from a Gora MemStore
> (GoraSourceIT.testGoraTarget()):
> 1035 [main] WARN  org.apache.hadoop.util.NativeCodeLoader  - Unable to
> load native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2205 [Thread-2] WARN  org.apache.hadoop.mapreduce.JobSubmitter  - Hadoop
> command-line option parsing not performed. Implement the Tool interface and
> execute your application with ToolRunner to remedy this.
> 2207 [Thread-2] WARN  org.apache.hadoop.mapreduce.JobSubmitter  - No job
> jar file set.  User classes may not be found. See Job or Job#setJar(String).
> 2925 [Thread-2] INFO
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob  -
> Running job "org.apache.crunch.io.gora.GoraSourceIT:
> GoraDataStore(org.apache.gora.memory.store.MemStore@2b3b2... ID=1 (1/1)"
> 2925 [Thread-2] INFO
> org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob  -
> Job status available at: http://localhost:8080/
> java.util.NoSuchElementException
>     at java.util.TreeMap.key(TreeMap.java:1221)
>     at java.util.TreeMap.firstKey(TreeMap.java:285)
>     at org.apache.gora.memory.store.MemStore.execute(MemStore.java:125)
>     at org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:73)
>     at
> org.apache.gora.mapreduce.GoraRecordReader.executeQuery(GoraRecordReader.java:68)
>     at
> org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:110)
>     at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
>     at
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
>     at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:745)
>
> - Should there be an equivalent to HBaseTypes.puts and HBaseTypes.deletes
> with Gora?
>
> - When Crunch was imported to Eclipse, the following problem appeared in
> crunch-hbase/pom.xml:
> Plugin execution not covered by lifecycle configuration:
>  org.apache.maven.plugins:maven-dependency-plugin:2.8:build-classpath
>  (execution: create-mrapp-generated-classpath, phase: generate-test-
>  resources)
> What could be the reason (for the moment I let Eclipse automatically fix
> the problem) ?
>
> - More generally, what about code quality? (still junior...)
>
> I don't know if it's headed in the right place, so thanks in advance for
> your directions.
>
> Vincent
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message