pyspark.pandas.DataFrame.apply¶
- 
DataFrame.apply(func: Callable, axis: Union[int, str] = 0, args: Sequence[Any] = (), **kwds: Any) → Union[Series, DataFrame, Index][source]¶
- Apply a function along an axis of the DataFrame. - Objects passed to the function are Series objects whose index is either the DataFrame’s index ( - axis=0) or the DataFrame’s columns (- axis=1).- See also Transform and apply a function. - Note - when axis is 0 or ‘index’, the func is unable to access to the whole input series. pandas-on-Spark internally splits the input series into multiple batches and calls func with each batch multiple times. Therefore, operations such as global aggregations are impossible. See the example below. - >>> # This case does not return the length of whole series but of the batch internally ... # used. ... def length(s) -> int: ... return len(s) ... >>> df = ps.DataFrame({'A': range(1000)}) >>> df.apply(length, axis=0) 0 83 1 83 2 83 ... 10 83 11 83 dtype: int32 - Note - this API executes the function once to infer the type which is potentially expensive, for instance, when the dataset is created after aggregations or sorting. - To avoid this, specify the return type as Series or scalar value in - func, for instance, as below:- >>> def square(s) -> ps.Series[np.int32]: ... return s ** 2 - pandas-on-Spark uses return type hints and does not try to infer the type. - In case when axis is 1, it requires to specify DataFrame or scalar value with type hints as below: - >>> def plus_one(x) -> ps.DataFrame[int, [float, float]]: ... return x + 1 - If the return type is specified as DataFrame, the output column names become c0, c1, c2 … cn. These names are positionally mapped to the returned DataFrame in - func.- To specify the column names, you can assign them in a pandas style as below: - >>> def plus_one(x) -> ps.DataFrame[("index", int), [("a", float), ("b", float)]]: ... return x + 1 - >>> pdf = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]}) >>> def plus_one(x) -> ps.DataFrame[ ... (pdf.index.name, pdf.index.dtype), zip(pdf.dtypes, pdf.columns)]: ... return x + 1 - Parameters
- funcfunction
- Function to apply to each column or row. 
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- Axis along which the function is applied: - 0 or ‘index’: apply function to each column. 
- 1 or ‘columns’: apply function to each row. 
 
- argstuple
- Positional arguments to pass to func in addition to the array/series. 
- **kwds
- Additional keyword arguments to pass as keywords arguments to func. 
 
- Returns
- Series or DataFrame
- Result of applying - funcalong the given axis of the DataFrame.
 
 - See also - DataFrame.applymap
- For elementwise operations. 
- DataFrame.aggregate
- Only perform aggregating type operations. 
- DataFrame.transform
- Only perform transforming type operations. 
- Series.apply
- The equivalent function for Series. 
 - Examples - >>> df = ps.DataFrame([[4, 9]] * 3, columns=['A', 'B']) >>> df A B 0 4 9 1 4 9 2 4 9 - Using a numpy universal function (in this case the same as - np.sqrt(df)):- >>> def sqrt(x) -> ps.Series[float]: ... return np.sqrt(x) ... >>> df.apply(sqrt, axis=0) A B 0 2.0 3.0 1 2.0 3.0 2 2.0 3.0 - You can omit type hints and let pandas-on-Spark infer its type. - >>> df.apply(np.sqrt, axis=0) A B 0 2.0 3.0 1 2.0 3.0 2 2.0 3.0 - When axis is 1 or ‘columns’, it applies the function for each row. - >>> def summation(x) -> np.int64: ... return np.sum(x) ... >>> df.apply(summation, axis=1) 0 13 1 13 2 13 dtype: int64 - You can omit type hints and let pandas-on-Spark infer its type. - >>> df.apply(np.sum, axis=1) 0 13 1 13 2 13 dtype: int64 - >>> df.apply(max, axis=1) 0 9 1 9 2 9 dtype: int64 - Returning a list-like will result in a Series - >>> df.apply(lambda x: [1, 2], axis=1) 0 [1, 2] 1 [1, 2] 2 [1, 2] dtype: object - To specify the types when axis is ‘1’, it should use DataFrame[…] annotation. In this case, the column names are automatically generated. - >>> def identify(x) -> ps.DataFrame[('index', int), [('A', np.int64), ('B', np.int64)]]: ... return x ... >>> df.apply(identify, axis=1) A B index 0 4 9 1 4 9 2 4 9 - You can also specify extra arguments. - >>> def plus_two(a, b, c) -> ps.DataFrame[np.int64, [np.int64, np.int64]]: ... return a + b + c ... >>> df.apply(plus_two, axis=1, args=(1,), c=3) c0 c1 0 8 13 1 8 13 2 8 13