pyspark.pandas.DataFrame.to_parquet¶
- 
DataFrame.to_parquet(path: str, mode: str = 'w', partition_cols: Union[str, List[str], None] = None, compression: Optional[str] = None, index_col: Union[str, List[str], None] = None, **options: Any) → None[source]¶
- Write the DataFrame out as a Parquet file or directory. - Parameters
- pathstr, required
- Path to write to. 
- modestr
- Python write mode, default ‘w’. - Note - mode can accept the strings for Spark writing mode. Such as ‘append’, ‘overwrite’, ‘ignore’, ‘error’, ‘errorifexists’. - ‘append’ (equivalent to ‘a’): Append the new data to existing data. 
- ‘overwrite’ (equivalent to ‘w’): Overwrite existing data. 
- ‘ignore’: Silently ignore this operation if data already exists. 
- ‘error’ or ‘errorifexists’: Throw an exception if data already exists. 
 
- partition_colsstr or list of str, optional, default None
- Names of partitioning columns 
- compressionstr {‘none’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘lz4’, ‘zstd’}
- Compression codec to use when saving to file. If None is set, it uses the value specified in spark.sql.parquet.compression.codec. 
- index_col: str or list of str, optional, default: None
- Column names to be used in Spark to represent pandas-on-Spark’s index. The index name in pandas-on-Spark is ignored. By default, the index is always lost. 
- optionsdict
- All other options passed directly into Spark’s data source. 
 
 - Examples - >>> df = ps.DataFrame(dict( ... date=list(pd.date_range('2012-1-1 12:00:00', periods=3, freq='M')), ... country=['KR', 'US', 'JP'], ... code=[1, 2 ,3]), columns=['date', 'country', 'code']) >>> df date country code 0 2012-01-31 12:00:00 KR 1 1 2012-02-29 12:00:00 US 2 2 2012-03-31 12:00:00 JP 3 - >>> df.to_parquet('%s/to_parquet/foo.parquet' % path, partition_cols='date') - >>> df.to_parquet( ... '%s/to_parquet/foo.parquet' % path, ... mode = 'overwrite', ... partition_cols=['date', 'country'])