CoderFunda: pandas/pyarrow ArrowTypeError: Unable to merge: Field on handcrafted partitioned parquet

I read the parquet file directly
pd.read_parquet(r'C:\Datasets\cn_data\dm\qmt\wqa_mfeatures\30m\year=2020\month=11\data.parquet')

No error will be reported.

But when I read the directory:
pd.read_parquet(r'C:\Datasets\cn_data\dm\qmt\wqa_mfeatures\30m\year=2020')

An error will be reported
ArrowTypeError: Unable to merge: Field month has incompatible types: int32 vs dictionary

This is because I handcrafted this partitioned path.
It is important that I have to hand craft the partitioned path.
I have 5000 item need transform and write to df.to_parquet(path, partition_cols=['year', 'month']) , and yes it would not overwrite existing files.

* But if I only need rerun 300 item, it would preduce new files, I can't delete the data produce by last run.

* With time goes, I need rerun transform function on new dates(year,month), I need new data overwrite old data.

I just want to keep pd.read_parquet function working with handcraft paths, this can reduce many works on reimplement a similar stuff and refactor pd.read_parquet in many projects.

Pages

08 April, 2024

pandas/pyarrow ArrowTypeError: Unable to merge: Field on handcrafted partitioned parquet

No comments:

Post a Comment