pyspark.pandas.DataFrame¶

class pyspark.pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)[source]¶

pandas-on-Spark DataFrame that corresponds to pandas DataFrame logically. This holds Spark DataFrame internally.

Variables

_internal – an internal immutable Frame to manage metadata.

Parameters

datanumpy ndarray (structured or homogeneous), dict, pandas DataFrame,: Spark DataFrame, pandas-on-Spark DataFrame or pandas-on-Spark Series. Dict can contain Series, arrays, constants, or list-like objects
indexIndex or array-like: Index to use for the resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided
columnsIndex or array-like: Column labels to use for the resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
dtypedtype, default None: Data type to force. Only a single dtype is allowed. If None, infer
copyboolean, default False: Copy data from inputs. Only affects DataFrame / 2d ndarray input
.. versionchanged:: 3.4.0: Since 3.4.0, it deals with data and index in this approach: 1, when data is a distributed dataset (Internal DataFrame/Spark DataFrame/ pandas-on-Spark DataFrame/pandas-on-Spark Series), it will first parallelize the index if necessary, and then try to combine the data and index; Note that if data and index doesn’t have the same anchor, then compute.ops_on_diff_frames should be turned on; 2, when data is a local dataset (Pandas DataFrame/numpy ndarray/list/etc), it will first collect the index to driver if necessary, and then apply the pandas.DataFrame(…) creation internally;

Examples

Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = ps.DataFrame(data=d, columns=['col1', 'col2'])
>>> df
   col1  col2
0     1     3
1     2     4

Constructing DataFrame from pandas DataFrame

>>> df = ps.DataFrame(pd.DataFrame(data=d, columns=['col1', 'col2']))
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = ps.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from numpy ndarray:

>>> import numpy as np
>>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
...     columns=['a', 'b', 'c', 'd', 'e'])
   a  b  c  d  e
0  1  2  3  4  5
1  6  7  8  9  0

Constructing DataFrame from numpy ndarray with Pandas index:

>>> import numpy as np
>>> import pandas as pd

>>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
...     index=pd.Index([1, 4]), columns=['a', 'b', 'c', 'd', 'e'])
   a  b  c  d  e
1  1  2  3  4  5
4  6  7  8  9  0

Constructing DataFrame from numpy ndarray with pandas-on-Spark index:

>>> import numpy as np
>>> import pandas as pd
>>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
...     index=ps.Index([1, 4]), columns=['a', 'b', 'c', 'd', 'e'])
   a  b  c  d  e
1  1  2  3  4  5
4  6  7  8  9  0

Constructing DataFrame from Pandas DataFrame with Pandas index:

>>> import numpy as np
>>> import pandas as pd
>>> pdf = pd.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
...     columns=['a', 'b', 'c', 'd', 'e'])
>>> ps.DataFrame(data=pdf, index=pd.Index([1, 4]))
     a    b    c    d    e
1  6.0  7.0  8.0  9.0  0.0
4  NaN  NaN  NaN  NaN  NaN

Constructing DataFrame from Pandas DataFrame with pandas-on-Spark index:

>>> import numpy as np
>>> import pandas as pd
>>> pdf = pd.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
...     columns=['a', 'b', 'c', 'd', 'e'])
>>> ps.DataFrame(data=pdf, index=ps.Index([1, 4]))
     a    b    c    d    e
1  6.0  7.0  8.0  9.0  0.0
4  NaN  NaN  NaN  NaN  NaN

Constructing DataFrame from Spark DataFrame with Pandas index:

>>> import pandas as pd
>>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"])
>>> ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2]))
Traceback (most recent call last):
  ...
ValueError: Cannot combine the series or dataframe...'compute.ops_on_diff_frames' option.

Enable ‘compute.ops_on_diff_frames’ to combine SparkDataFrame and Pandas index

>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2]))
        x    y
0    Data  1.0
1  Bricks  2.0
2    None  NaN

Constructing DataFrame from Spark DataFrame with pandas-on-Spark index:

>>> import pandas as pd
>>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"])
>>> ps.DataFrame(data=sdf, index=ps.Index([0, 1, 2]))
Traceback (most recent call last):
  ...
ValueError: Cannot combine the series or dataframe...'compute.ops_on_diff_frames' option.

Enable ‘compute.ops_on_diff_frames’ to combine Spark DataFrame and pandas-on-Spark index

>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     ps.DataFrame(data=sdf, index=ps.Index([0, 1, 2]))
        x    y
0    Data  1.0
1  Bricks  2.0
2    None  NaN

Methods

`abs`()	Return a Series/DataFrame with absolute numeric value of each element.
`add`(other)	Get Addition of dataframe and other, element-wise (binary operator +).
`add_prefix`(prefix)	Prefix labels with string prefix.
`add_suffix`(suffix)	Suffix labels with string suffix.
`agg`(func)	Aggregate using one or more operations over the specified axis.
`aggregate`(func)	Aggregate using one or more operations over the specified axis.
`align`(other[, join, axis, copy])	Align two objects on their axes with the specified join method.
`all`([axis, bool_only, skipna])	Return whether all elements are True.
`any`([axis, bool_only])	Return whether any element is True.
`append`(other[, ignore_index, …])	Append rows of other to the end of caller, returning a new object.
`apply`(func[, axis, args])	Apply a function along an axis of the DataFrame.
`applymap`(func)	Apply a function to a Dataframe elementwise.
`assign`(**kwargs)	Assign new columns to a DataFrame.
`astype`(dtype)	Cast a pandas-on-Spark object to a specified dtype `dtype`.
`at_time`(time[, asof, axis])	Select values at particular time of day (example: 9:30AM).
`backfill`([axis, inplace, limit])	Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill`.
`between_time`(start_time, end_time[, …])	Select values between particular times of the day (example: 9:00-9:30 AM).
`bfill`([axis, inplace, limit])	Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill`.
`bool`()	Return the bool of a single element in the current object.
`boxplot`(**kwds)	Make a box plot of the Series columns.
`clip`([lower, upper])	Trim values at input threshold(s).
`combine_first`(other)	Update null elements with value in the same location in other.
`copy`([deep])	Make a copy of this object’s indices and data.
`corr`([method, min_periods])	Compute pairwise correlation of columns, excluding NA/null values.
`corrwith`(other[, axis, drop, method])	Compute pairwise correlation.
`count`([axis, numeric_only])	Count non-NA cells for each column.
`cov`([min_periods, ddof])	Compute pairwise covariance of columns, excluding NA/null values.
`cummax`([skipna])	Return cumulative maximum over a DataFrame or Series axis.
`cummin`([skipna])	Return cumulative minimum over a DataFrame or Series axis.
`cumprod`([skipna])	Return cumulative product over a DataFrame or Series axis.
`cumsum`([skipna])	Return cumulative sum over a DataFrame or Series axis.
`describe`([percentiles])	Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding `NaN` values.
`diff`([periods, axis])	First discrete difference of element.
`div`(other)	Get Floating division of dataframe and other, element-wise (binary operator /).
`divide`(other)	Get Floating division of dataframe and other, element-wise (binary operator /).
`dot`(other)	Compute the matrix multiplication between the DataFrame and others.
`drop`([labels, axis, index, columns])	Drop specified labels from columns.
`drop_duplicates`([subset, keep, inplace, …])	Return DataFrame with duplicate rows removed, optionally only considering certain columns.
`droplevel`(level[, axis])	Return DataFrame with requested index / column level(s) removed.
`dropna`([axis, how, thresh, subset, inplace])	Remove missing values.
`duplicated`([subset, keep])	Return boolean Series denoting duplicate rows, optionally only considering certain columns.
`eq`(other)	Compare if the current value is equal to the other.
`equals`(other)	Compare if the current value is equal to the other.
`eval`(expr[, inplace])	Evaluate a string describing operations on DataFrame columns.
`ewm`([com, span, halflife, alpha, …])	Provide exponentially weighted window transformations.
`expanding`([min_periods])	Provide expanding transformations.
`explode`(column[, ignore_index])	Transform each element of a list-like to a row, replicating index values.
`ffill`([axis, inplace, limit])	Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill`.
`fillna`([value, method, axis, inplace, limit])	Fill NA/NaN values.
`filter`([items, like, regex, axis])	Subset rows or columns of dataframe according to labels in the specified index.
`first`(offset)	Select first periods of time series data based on a date offset.
`first_valid_index`()	Retrieves the index of the first valid value.
`floordiv`(other)	Get Integer division of dataframe and other, element-wise (binary operator //).
`from_dict`(data[, orient, dtype, columns])	Construct DataFrame from dict of array-like or dicts.
`from_records`(data[, index, exclude, …])	Convert structured or recorded ndarray to DataFrame.
`ge`(other)	Compare if the current value is greater than or equal to the other.
`get`(key[, default])	Get item from object for given key (DataFrame column, Panel slice, etc.).
`get_dtype_counts`()	Return counts of unique dtypes in this object.
`groupby`(by[, axis, as_index, dropna])	Group DataFrame or Series using one or more columns.
`gt`(other)	Compare if the current value is greater than the other.
`head`([n])	Return the first n rows.
`hist`([bins])	Draw one histogram of the DataFrame’s columns.
`idxmax`([axis])	Return index of first occurrence of maximum over requested axis.
`idxmin`([axis])	Return index of first occurrence of minimum over requested axis.
`info`([verbose, buf, max_cols])	Print a concise summary of a DataFrame.
`insert`(loc, column, value[, allow_duplicates])	Insert column into DataFrame at specified location.
`interpolate`([method, limit, …])	Fill NaN values using an interpolation method.
`isin`(values)	Whether each element in the DataFrame is contained in values.
`isna`()	Detects missing values for items in the current Dataframe.
`isnull`()	Detects missing values for items in the current Dataframe.
`items`()	Iterator over (column name, Series) pairs.
`iteritems`()	This is an alias of `items`.
`iterrows`()	Iterate over DataFrame rows as (index, Series) pairs.
`itertuples`([index, name])	Iterate over DataFrame rows as namedtuples.
`join`(right[, on, how, lsuffix, rsuffix])	Join columns of another DataFrame.
`kde`([bw_method, ind])	Generate Kernel Density Estimate plot using Gaussian kernels.
`keys`()	Return alias for columns.
`kurt`([axis, skipna, numeric_only])	Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).
`kurtosis`([axis, skipna, numeric_only])	Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).
`last`(offset)	Select final periods of time series data based on a date offset.
`last_valid_index`()	Return index for last non-NA/null value.
`le`(other)	Compare if the current value is less than or equal to the other.
`lt`(other)	Compare if the current value is less than the other.
`mad`([axis])	Return the mean absolute deviation of values.
`mask`(cond[, other])	Replace values where the condition is True.
`max`([axis, skipna, numeric_only])	Return the maximum of the values.
`mean`([axis, skipna, numeric_only])	Return the mean of the values.
`median`([axis, skipna, numeric_only, accuracy])	Return the median of the values for the requested axis.
`melt`([id_vars, value_vars, var_name, value_name])	Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set.
`merge`(right[, how, on, left_on, right_on, …])	Merge DataFrame objects with a database-style join.
`min`([axis, skipna, numeric_only])	Return the minimum of the values.
`mod`(other)	Get Modulo of dataframe and other, element-wise (binary operator %).
`mode`([axis, numeric_only, dropna])	Get the mode(s) of each element along the selected axis.
`mul`(other)	Get Multiplication of dataframe and other, element-wise (binary operator *).
`multiply`(other)	Get Multiplication of dataframe and other, element-wise (binary operator *).
`ne`(other)	Compare if the current value is not equal to the other.
`nlargest`(n, columns[, keep])	Return the first n rows ordered by columns in descending order.
`notna`()	Detects non-missing values for items in the current Dataframe.
`notnull`()	Detects non-missing values for items in the current Dataframe.
`nsmallest`(n, columns[, keep])	Return the first n rows ordered by columns in ascending order.
`nunique`([axis, dropna, approx, rsd])	Return number of unique elements in the object.
`pad`([axis, inplace, limit])	Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill`.
`pct_change`([periods])	Percentage change between the current and a prior element.
`pipe`(func, args, *kwargs)	Apply func(self, args, *kwargs).
`pivot`([index, columns, values])	Return reshaped DataFrame organized by given index / column values.
`pivot_table`([values, index, columns, …])	Create a spreadsheet-style pivot table as a DataFrame.
`pop`(item)	Return item and drop from frame.
`pow`(other)	Get Exponential power of series of dataframe and other, element-wise (binary operator **).
`prod`([axis, skipna, numeric_only, min_count])	Return the product of the values.
`product`([axis, skipna, numeric_only, min_count])	Return the product of the values.
`quantile`([q, axis, numeric_only, accuracy])	Return value at the given quantile.
`query`(expr[, inplace])	Query the columns of a DataFrame with a boolean expression.
`radd`(other)	Get Addition of dataframe and other, element-wise (binary operator +).
`rank`([method, ascending, numeric_only])	Compute numerical data ranks (1 through n) along axis.
`rdiv`(other)	Get Floating division of dataframe and other, element-wise (binary operator /).
`reindex`([labels, index, columns, axis, …])	Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index.
`reindex_like`(other[, copy])	Return a DataFrame with matching indices as other object.
`rename`([mapper, index, columns, axis, …])	Alter axes labels.
`rename_axis`([mapper, index, columns, axis, …])	Set the name of the axis for the index or columns.
`replace`([to_replace, value, inplace, limit, …])	Returns a new DataFrame replacing a value with another value.
`resample`(rule[, closed, label, on])	Resample time-series data.
`reset_index`([level, drop, inplace, …])	Reset the index, or a level of it.
`rfloordiv`(other)	Get Integer division of dataframe and other, element-wise (binary operator //).
`rmod`(other)	Get Modulo of dataframe and other, element-wise (binary operator %).
`rmul`(other)	Get Multiplication of dataframe and other, element-wise (binary operator *).
`rolling`(window[, min_periods])	Provide rolling transformations.
`round`([decimals])	Round a DataFrame to a variable number of decimal places.
`rpow`(other)	Get Exponential power of dataframe and other, element-wise (binary operator **).
`rsub`(other)	Get Subtraction of dataframe and other, element-wise (binary operator -).
`rtruediv`(other)	Get Floating division of dataframe and other, element-wise (binary operator /).
`sample`([n, frac, replace, random_state, …])	Return a random sample of items from an axis of object.
`select_dtypes`([include, exclude])	Return a subset of the DataFrame’s columns based on the column dtypes.
`sem`([axis, skipna, ddof, numeric_only])	Return unbiased standard error of the mean over requested axis.
`set_index`(keys[, drop, append, inplace])	Set the DataFrame index (row labels) using one or more existing columns.
`shift`([periods, fill_value])	Shift DataFrame by desired number of periods.
`skew`([axis, skipna, numeric_only])	Return unbiased skew normalized by N-1.
`sort_index`([axis, level, ascending, …])	Sort object by labels (along an axis)
`sort_values`(by[, ascending, inplace, …])	Sort by the values along either axis.
`squeeze`([axis])	Squeeze 1 dimensional axis objects into scalars.
`stack`()	Stack the prescribed level(s) from columns to index.
`std`([axis, skipna, ddof, numeric_only])	Return sample standard deviation.
`sub`(other)	Get Subtraction of dataframe and other, element-wise (binary operator -).
`subtract`(other)	Get Subtraction of dataframe and other, element-wise (binary operator -).
`sum`([axis, skipna, numeric_only, min_count])	Return the sum of the values.
`swapaxes`(i, j[, copy])	Interchange axes and swap values axes appropriately.
`swaplevel`([i, j, axis])	Swap levels i and j in a MultiIndex on a particular axis.
`tail`([n])	Return the last n rows.
`take`(indices[, axis])	Return the elements in the given positional indices along an axis.
`to_clipboard`([excel, sep])	Copy object to the system clipboard.
`to_csv`([path, sep, na_rep, columns, header, …])	Write object to a comma-separated values (csv) file.
`to_delta`(path[, mode, partition_cols, index_col])	Write the DataFrame out as a Delta Lake table.
`to_dict`([orient, into])	Convert the DataFrame to a dictionary.
`to_excel`(excel_writer[, sheet_name, na_rep, …])	Write object to an Excel sheet.
`to_html`([buf, columns, col_space, header, …])	Render a DataFrame as an HTML table.
`to_json`([path, compression, num_files, …])	Convert the object to a JSON string.
`to_latex`([buf, columns, col_space, header, …])	Render an object to a LaTeX tabular environment table.
`to_markdown`([buf, mode])	Print Series or DataFrame in Markdown-friendly format.
`to_numpy`()	A NumPy ndarray representing the values in this DataFrame or Series.
`to_orc`(path[, mode, partition_cols, index_col])	Write a DataFrame to the ORC format.
`to_pandas`()	Return a pandas DataFrame.
`to_parquet`(path[, mode, partition_cols, …])	Write the DataFrame out as a Parquet file or directory.
`to_records`([index, column_dtypes, index_dtypes])	Convert DataFrame to a NumPy record array.
`to_spark`([index_col])	Spark related features.
`to_spark_io`([path, format, mode, …])	Write the DataFrame out to a Spark data source.
`to_string`([buf, columns, col_space, header, …])	Render a DataFrame to a console-friendly tabular output.
`to_table`(name[, format, mode, …])	Write the DataFrame into a Spark table.
`transform`(func[, axis])	Call `func` on self producing a Series with transformed values and that has the same length as its input.
`transpose`()	Transpose index and columns.
`truediv`(other)	Get Floating division of dataframe and other, element-wise (binary operator /).
`truncate`([before, after, axis, copy])	Truncate a Series or DataFrame before and after some index value.
`unstack`()	Pivot the (necessarily hierarchical) index labels.
`update`(other[, join, overwrite])	Modify in place using non-NA values from another DataFrame.
`var`([axis, ddof, numeric_only])	Return unbiased variance.
`where`(cond[, other, axis])	Replace values where the condition is False.
`xs`(key[, axis, level])	Return cross-section from the DataFrame.

Attributes

`T`	Transpose index and columns.
`at`	Access a single value for a row/column label pair.
`axes`	Return a list representing the axes of the DataFrame.
`columns`	The column labels of the DataFrame.
`dtypes`	Return the dtypes in the DataFrame.
`empty`	Returns true if the current DataFrame is empty.
`iat`	Access a single value for a row/column pair by integer position.
`iloc`	Purely integer-location based indexing for selection by position.
`index`	The index (row labels) Column of the DataFrame.
`loc`	Access a group of rows and columns by label(s) or a boolean Series.
`ndim`	Return an int representing the number of array dimensions.
`shape`	Return a tuple representing the dimensionality of the DataFrame.
`size`	Return an int representing the number of elements in this object.
`style`	Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame.
`values`	Return a Numpy representation of the DataFrame or the Series.

DataFrame

pyspark.pandas.DataFrame.index