pyspark.pandas.DataFrame.spark.apply¶
- 
spark.apply(func: Callable[[pyspark.sql.dataframe.DataFrame], pyspark.sql.dataframe.DataFrame], index_col: Union[str, List[str], None] = None) → ps.DataFrame¶
- Applies a function that takes and returns a Spark DataFrame. It allows natively apply a Spark function and column APIs with the Spark column internally used in Series or Index. - Note - set index_col and keep the column named as so in the output Spark DataFrame to avoid using the default index to prevent performance penalty. If you omit index_col, it will use default index which is potentially expensive in general. - Note - it will lose column labels. This is a synonym of - func(psdf.to_spark(index_col)).pandas_api(index_col).- Parameters
- funcfunction
- Function to apply the function against the data by using Spark DataFrame. 
 
- Returns
- DataFrame
 
- Raises
- ValueErrorIf the output from the function is not a Spark DataFrame.
 
 - Examples - >>> psdf = ps.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}, columns=["a", "b"]) >>> psdf a b 0 1 4 1 2 5 2 3 6 - >>> psdf.spark.apply( ... lambda sdf: sdf.selectExpr("a + b as c", "index"), index_col="index") ... c index 0 5 1 7 2 9 - The case below ends up with using the default index, which should be avoided if possible. - >>> psdf.spark.apply(lambda sdf: sdf.groupby("a").count().sort("a")) a count 0 1 1 1 2 1 2 3 1