pyspark.pandas.DataFrame¶
- 
class pyspark.pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)[source]¶
- pandas-on-Spark DataFrame that corresponds to pandas DataFrame logically. This holds Spark DataFrame internally. - Variables
- _internal – an internal immutable Frame to manage metadata. 
- Parameters
- datanumpy ndarray (structured or homogeneous), dict, pandas DataFrame,
- Spark DataFrame, pandas-on-Spark DataFrame or pandas-on-Spark Series. Dict can contain Series, arrays, constants, or list-like objects 
- indexIndex or array-like
- Index to use for the resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided 
- columnsIndex or array-like
- Column labels to use for the resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided 
- dtypedtype, default None
- Data type to force. Only a single dtype is allowed. If None, infer 
- copyboolean, default False
- Copy data from inputs. Only affects DataFrame / 2d ndarray input 
- .. versionchanged:: 3.4.0
- Since 3.4.0, it deals with data and index in this approach: 1, when data is a distributed dataset (Internal DataFrame/Spark DataFrame/ pandas-on-Spark DataFrame/pandas-on-Spark Series), it will first parallelize the index if necessary, and then try to combine the data and index; Note that if data and index doesn’t have the same anchor, then compute.ops_on_diff_frames should be turned on; 2, when data is a local dataset (Pandas DataFrame/numpy ndarray/list/etc), it will first collect the index to driver if necessary, and then apply the pandas.DataFrame(…) creation internally; 
 
 - Examples - Constructing DataFrame from a dictionary. - >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = ps.DataFrame(data=d, columns=['col1', 'col2']) >>> df col1 col2 0 1 3 1 2 4 - Constructing DataFrame from pandas DataFrame - >>> df = ps.DataFrame(pd.DataFrame(data=d, columns=['col1', 'col2'])) >>> df col1 col2 0 1 3 1 2 4 - Notice that the inferred dtype is int64. - >>> df.dtypes col1 int64 col2 int64 dtype: object - To enforce a single dtype: - >>> df = ps.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object - Constructing DataFrame from numpy ndarray: - >>> import numpy as np >>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]), ... columns=['a', 'b', 'c', 'd', 'e']) a b c d e 0 1 2 3 4 5 1 6 7 8 9 0 - Constructing DataFrame from numpy ndarray with Pandas index: - >>> import numpy as np >>> import pandas as pd - >>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]), ... index=pd.Index([1, 4]), columns=['a', 'b', 'c', 'd', 'e']) a b c d e 1 1 2 3 4 5 4 6 7 8 9 0 - Constructing DataFrame from numpy ndarray with pandas-on-Spark index: - >>> import numpy as np >>> import pandas as pd >>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]), ... index=ps.Index([1, 4]), columns=['a', 'b', 'c', 'd', 'e']) a b c d e 1 1 2 3 4 5 4 6 7 8 9 0 - Constructing DataFrame from Pandas DataFrame with Pandas index: - >>> import numpy as np >>> import pandas as pd >>> pdf = pd.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]), ... columns=['a', 'b', 'c', 'd', 'e']) >>> ps.DataFrame(data=pdf, index=pd.Index([1, 4])) a b c d e 1 6.0 7.0 8.0 9.0 0.0 4 NaN NaN NaN NaN NaN - Constructing DataFrame from Pandas DataFrame with pandas-on-Spark index: - >>> import numpy as np >>> import pandas as pd >>> pdf = pd.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]), ... columns=['a', 'b', 'c', 'd', 'e']) >>> ps.DataFrame(data=pdf, index=ps.Index([1, 4])) a b c d e 1 6.0 7.0 8.0 9.0 0.0 4 NaN NaN NaN NaN NaN - Constructing DataFrame from Spark DataFrame with Pandas index: - >>> import pandas as pd >>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"]) >>> ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2])) Traceback (most recent call last): ... ValueError: Cannot combine the series or dataframe...'compute.ops_on_diff_frames' option. - Enable ‘compute.ops_on_diff_frames’ to combine SparkDataFrame and Pandas index - >>> with ps.option_context("compute.ops_on_diff_frames", True): ... ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2])) x y 0 Data 1.0 1 Bricks 2.0 2 None NaN - Constructing DataFrame from Spark DataFrame with pandas-on-Spark index: - >>> import pandas as pd >>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"]) >>> ps.DataFrame(data=sdf, index=ps.Index([0, 1, 2])) Traceback (most recent call last): ... ValueError: Cannot combine the series or dataframe...'compute.ops_on_diff_frames' option. - Enable ‘compute.ops_on_diff_frames’ to combine Spark DataFrame and pandas-on-Spark index - >>> with ps.option_context("compute.ops_on_diff_frames", True): ... ps.DataFrame(data=sdf, index=ps.Index([0, 1, 2])) x y 0 Data 1.0 1 Bricks 2.0 2 None NaN - Methods - abs()- Return a Series/DataFrame with absolute numeric value of each element. - add(other)- Get Addition of dataframe and other, element-wise (binary operator +). - add_prefix(prefix)- Prefix labels with string prefix. - add_suffix(suffix)- Suffix labels with string suffix. - agg(func)- Aggregate using one or more operations over the specified axis. - aggregate(func)- Aggregate using one or more operations over the specified axis. - align(other[, join, axis, copy])- Align two objects on their axes with the specified join method. - all([axis, bool_only, skipna])- Return whether all elements are True. - any([axis, bool_only])- Return whether any element is True. - append(other[, ignore_index, …])- Append rows of other to the end of caller, returning a new object. - apply(func[, axis, args])- Apply a function along an axis of the DataFrame. - applymap(func)- Apply a function to a Dataframe elementwise. - assign(**kwargs)- Assign new columns to a DataFrame. - astype(dtype)- Cast a pandas-on-Spark object to a specified dtype - dtype.- at_time(time[, asof, axis])- Select values at particular time of day (example: 9:30AM). - backfill([axis, inplace, limit])- Synonym for DataFrame.fillna() or Series.fillna() with - method=`bfill`.- between_time(start_time, end_time[, …])- Select values between particular times of the day (example: 9:00-9:30 AM). - bfill([axis, inplace, limit])- Synonym for DataFrame.fillna() or Series.fillna() with - method=`bfill`.- bool()- Return the bool of a single element in the current object. - boxplot(**kwds)- Make a box plot of the Series columns. - clip([lower, upper])- Trim values at input threshold(s). - combine_first(other)- Update null elements with value in the same location in other. - copy([deep])- Make a copy of this object’s indices and data. - corr([method, min_periods])- Compute pairwise correlation of columns, excluding NA/null values. - corrwith(other[, axis, drop, method])- Compute pairwise correlation. - count([axis, numeric_only])- Count non-NA cells for each column. - cov([min_periods, ddof])- Compute pairwise covariance of columns, excluding NA/null values. - cummax([skipna])- Return cumulative maximum over a DataFrame or Series axis. - cummin([skipna])- Return cumulative minimum over a DataFrame or Series axis. - cumprod([skipna])- Return cumulative product over a DataFrame or Series axis. - cumsum([skipna])- Return cumulative sum over a DataFrame or Series axis. - describe([percentiles])- Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding - NaNvalues.- diff([periods, axis])- First discrete difference of element. - div(other)- Get Floating division of dataframe and other, element-wise (binary operator /). - divide(other)- Get Floating division of dataframe and other, element-wise (binary operator /). - dot(other)- Compute the matrix multiplication between the DataFrame and others. - drop([labels, axis, index, columns])- Drop specified labels from columns. - drop_duplicates([subset, keep, inplace, …])- Return DataFrame with duplicate rows removed, optionally only considering certain columns. - droplevel(level[, axis])- Return DataFrame with requested index / column level(s) removed. - dropna([axis, how, thresh, subset, inplace])- Remove missing values. - duplicated([subset, keep])- Return boolean Series denoting duplicate rows, optionally only considering certain columns. - eq(other)- Compare if the current value is equal to the other. - equals(other)- Compare if the current value is equal to the other. - eval(expr[, inplace])- Evaluate a string describing operations on DataFrame columns. - ewm([com, span, halflife, alpha, …])- Provide exponentially weighted window transformations. - expanding([min_periods])- Provide expanding transformations. - explode(column[, ignore_index])- Transform each element of a list-like to a row, replicating index values. - ffill([axis, inplace, limit])- Synonym for DataFrame.fillna() or Series.fillna() with - method=`ffill`.- fillna([value, method, axis, inplace, limit])- Fill NA/NaN values. - filter([items, like, regex, axis])- Subset rows or columns of dataframe according to labels in the specified index. - first(offset)- Select first periods of time series data based on a date offset. - Retrieves the index of the first valid value. - floordiv(other)- Get Integer division of dataframe and other, element-wise (binary operator //). - from_dict(data[, orient, dtype, columns])- Construct DataFrame from dict of array-like or dicts. - from_records(data[, index, exclude, …])- Convert structured or recorded ndarray to DataFrame. - ge(other)- Compare if the current value is greater than or equal to the other. - get(key[, default])- Get item from object for given key (DataFrame column, Panel slice, etc.). - get_dtype_counts()- Return counts of unique dtypes in this object. - groupby(by[, axis, as_index, dropna])- Group DataFrame or Series using one or more columns. - gt(other)- Compare if the current value is greater than the other. - head([n])- Return the first n rows. - hist([bins])- Draw one histogram of the DataFrame’s columns. - idxmax([axis])- Return index of first occurrence of maximum over requested axis. - idxmin([axis])- Return index of first occurrence of minimum over requested axis. - info([verbose, buf, max_cols])- Print a concise summary of a DataFrame. - insert(loc, column, value[, allow_duplicates])- Insert column into DataFrame at specified location. - interpolate([method, limit, …])- Fill NaN values using an interpolation method. - isin(values)- Whether each element in the DataFrame is contained in values. - isna()- Detects missing values for items in the current Dataframe. - isnull()- Detects missing values for items in the current Dataframe. - items()- Iterator over (column name, Series) pairs. - This is an alias of - items.- iterrows()- Iterate over DataFrame rows as (index, Series) pairs. - itertuples([index, name])- Iterate over DataFrame rows as namedtuples. - join(right[, on, how, lsuffix, rsuffix])- Join columns of another DataFrame. - kde([bw_method, ind])- Generate Kernel Density Estimate plot using Gaussian kernels. - keys()- Return alias for columns. - kurt([axis, skipna, numeric_only])- Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). - kurtosis([axis, skipna, numeric_only])- Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). - last(offset)- Select final periods of time series data based on a date offset. - Return index for last non-NA/null value. - le(other)- Compare if the current value is less than or equal to the other. - lt(other)- Compare if the current value is less than the other. - mad([axis])- Return the mean absolute deviation of values. - mask(cond[, other])- Replace values where the condition is True. - max([axis, skipna, numeric_only])- Return the maximum of the values. - mean([axis, skipna, numeric_only])- Return the mean of the values. - median([axis, skipna, numeric_only, accuracy])- Return the median of the values for the requested axis. - melt([id_vars, value_vars, var_name, value_name])- Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set. - merge(right[, how, on, left_on, right_on, …])- Merge DataFrame objects with a database-style join. - min([axis, skipna, numeric_only])- Return the minimum of the values. - mod(other)- Get Modulo of dataframe and other, element-wise (binary operator %). - mode([axis, numeric_only, dropna])- Get the mode(s) of each element along the selected axis. - mul(other)- Get Multiplication of dataframe and other, element-wise (binary operator *). - multiply(other)- Get Multiplication of dataframe and other, element-wise (binary operator *). - ne(other)- Compare if the current value is not equal to the other. - nlargest(n, columns[, keep])- Return the first n rows ordered by columns in descending order. - notna()- Detects non-missing values for items in the current Dataframe. - notnull()- Detects non-missing values for items in the current Dataframe. - nsmallest(n, columns[, keep])- Return the first n rows ordered by columns in ascending order. - nunique([axis, dropna, approx, rsd])- Return number of unique elements in the object. - pad([axis, inplace, limit])- Synonym for DataFrame.fillna() or Series.fillna() with - method=`ffill`.- pct_change([periods])- Percentage change between the current and a prior element. - pipe(func, *args, **kwargs)- Apply func(self, *args, **kwargs). - pivot([index, columns, values])- Return reshaped DataFrame organized by given index / column values. - pivot_table([values, index, columns, …])- Create a spreadsheet-style pivot table as a DataFrame. - pop(item)- Return item and drop from frame. - pow(other)- Get Exponential power of series of dataframe and other, element-wise (binary operator **). - prod([axis, skipna, numeric_only, min_count])- Return the product of the values. - product([axis, skipna, numeric_only, min_count])- Return the product of the values. - quantile([q, axis, numeric_only, accuracy])- Return value at the given quantile. - query(expr[, inplace])- Query the columns of a DataFrame with a boolean expression. - radd(other)- Get Addition of dataframe and other, element-wise (binary operator +). - rank([method, ascending, numeric_only])- Compute numerical data ranks (1 through n) along axis. - rdiv(other)- Get Floating division of dataframe and other, element-wise (binary operator /). - reindex([labels, index, columns, axis, …])- Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. - reindex_like(other[, copy])- Return a DataFrame with matching indices as other object. - rename([mapper, index, columns, axis, …])- Alter axes labels. - rename_axis([mapper, index, columns, axis, …])- Set the name of the axis for the index or columns. - replace([to_replace, value, inplace, limit, …])- Returns a new DataFrame replacing a value with another value. - resample(rule[, closed, label, on])- Resample time-series data. - reset_index([level, drop, inplace, …])- Reset the index, or a level of it. - rfloordiv(other)- Get Integer division of dataframe and other, element-wise (binary operator //). - rmod(other)- Get Modulo of dataframe and other, element-wise (binary operator %). - rmul(other)- Get Multiplication of dataframe and other, element-wise (binary operator *). - rolling(window[, min_periods])- Provide rolling transformations. - round([decimals])- Round a DataFrame to a variable number of decimal places. - rpow(other)- Get Exponential power of dataframe and other, element-wise (binary operator **). - rsub(other)- Get Subtraction of dataframe and other, element-wise (binary operator -). - rtruediv(other)- Get Floating division of dataframe and other, element-wise (binary operator /). - sample([n, frac, replace, random_state, …])- Return a random sample of items from an axis of object. - select_dtypes([include, exclude])- Return a subset of the DataFrame’s columns based on the column dtypes. - sem([axis, skipna, ddof, numeric_only])- Return unbiased standard error of the mean over requested axis. - set_index(keys[, drop, append, inplace])- Set the DataFrame index (row labels) using one or more existing columns. - shift([periods, fill_value])- Shift DataFrame by desired number of periods. - skew([axis, skipna, numeric_only])- Return unbiased skew normalized by N-1. - sort_index([axis, level, ascending, …])- Sort object by labels (along an axis) - sort_values(by[, ascending, inplace, …])- Sort by the values along either axis. - squeeze([axis])- Squeeze 1 dimensional axis objects into scalars. - stack()- Stack the prescribed level(s) from columns to index. - std([axis, skipna, ddof, numeric_only])- Return sample standard deviation. - sub(other)- Get Subtraction of dataframe and other, element-wise (binary operator -). - subtract(other)- Get Subtraction of dataframe and other, element-wise (binary operator -). - sum([axis, skipna, numeric_only, min_count])- Return the sum of the values. - swapaxes(i, j[, copy])- Interchange axes and swap values axes appropriately. - swaplevel([i, j, axis])- Swap levels i and j in a MultiIndex on a particular axis. - tail([n])- Return the last n rows. - take(indices[, axis])- Return the elements in the given positional indices along an axis. - to_clipboard([excel, sep])- Copy object to the system clipboard. - to_csv([path, sep, na_rep, columns, header, …])- Write object to a comma-separated values (csv) file. - to_delta(path[, mode, partition_cols, index_col])- Write the DataFrame out as a Delta Lake table. - to_dict([orient, into])- Convert the DataFrame to a dictionary. - to_excel(excel_writer[, sheet_name, na_rep, …])- Write object to an Excel sheet. - to_html([buf, columns, col_space, header, …])- Render a DataFrame as an HTML table. - to_json([path, compression, num_files, …])- Convert the object to a JSON string. - to_latex([buf, columns, col_space, header, …])- Render an object to a LaTeX tabular environment table. - to_markdown([buf, mode])- Print Series or DataFrame in Markdown-friendly format. - to_numpy()- A NumPy ndarray representing the values in this DataFrame or Series. - to_orc(path[, mode, partition_cols, index_col])- Write a DataFrame to the ORC format. - Return a pandas DataFrame. - to_parquet(path[, mode, partition_cols, …])- Write the DataFrame out as a Parquet file or directory. - to_records([index, column_dtypes, index_dtypes])- Convert DataFrame to a NumPy record array. - to_spark([index_col])- Spark related features. - to_spark_io([path, format, mode, …])- Write the DataFrame out to a Spark data source. - to_string([buf, columns, col_space, header, …])- Render a DataFrame to a console-friendly tabular output. - to_table(name[, format, mode, …])- Write the DataFrame into a Spark table. - transform(func[, axis])- Call - funcon self producing a Series with transformed values and that has the same length as its input.- Transpose index and columns. - truediv(other)- Get Floating division of dataframe and other, element-wise (binary operator /). - truncate([before, after, axis, copy])- Truncate a Series or DataFrame before and after some index value. - unstack()- Pivot the (necessarily hierarchical) index labels. - update(other[, join, overwrite])- Modify in place using non-NA values from another DataFrame. - var([axis, ddof, numeric_only])- Return unbiased variance. - where(cond[, other, axis])- Replace values where the condition is False. - xs(key[, axis, level])- Return cross-section from the DataFrame. - Attributes - Transpose index and columns. - Access a single value for a row/column label pair. - Return a list representing the axes of the DataFrame. - The column labels of the DataFrame. - Return the dtypes in the DataFrame. - Returns true if the current DataFrame is empty. - Access a single value for a row/column pair by integer position. - Purely integer-location based indexing for selection by position. - The index (row labels) Column of the DataFrame. - Access a group of rows and columns by label(s) or a boolean Series. - Return an int representing the number of array dimensions. - Return a tuple representing the dimensionality of the DataFrame. - Return an int representing the number of elements in this object. - Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame. - Return a Numpy representation of the DataFrame or the Series.