arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hossein Falaki (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ARROW-4723) Skip _files when reading a directory containing parquet files
Date Fri, 01 Mar 2019 02:39:00 GMT
Hossein Falaki created ARROW-4723:
-------------------------------------

             Summary: Skip _files when reading a directory containing parquet files
                 Key: ARROW-4723
                 URL: https://issues.apache.org/jira/browse/ARROW-4723
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Hossein Falaki


It is common for Apache Spark or other big data platforms to save additional meta-data files
denoted with _ when saving parquet data.

When using  {{make_batch_reader}} to load a directory saved by parquet containing such files
we encounter the following error:
{code:java}
PetastormMetadataError Traceback (most recent call last)
/databricks/python/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py in infer_or_load_unischema(dataset)
    388 try:
--> 389 return get_schema(dataset) 
    390 except PetastormMetadataError:

/databricks/python/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py in get_schema(dataset)
    342 raise PetastormMetadataError( 
--> 343 'Could not find _common_metadata file. Use materialize_dataset(..) in' 
    344 ' petastorm.etl.dataset_metadata.py to generate this file in your ETL code.'

PetastormMetadataError: Could not find _common_metadata file. Use materialize_dataset(..)
in petastorm.etl.dataset_metadata.py to generate this file in your ETL code. You can generate
it on an existing dataset using petastorm-generate-metadata.py{code}
 

This is because our Runtime stores the following two files at the end of the job:
{code:java}
dbfs:/tmp/petastorm/_committed_4686077819843716563	_committed_4686077819843716563	1965
dbfs:/tmp/petastorm/_started_4686077819843716563{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message