ChiSqSelector¶
- 
class pyspark.mllib.feature.ChiSqSelector(numTopFeatures: int = 50, selectorType: str = 'numTopFeatures', percentile: float = 0.1, fpr: float = 0.05, fdr: float = 0.05, fwe: float = 0.05)[source]¶
- Creates a ChiSquared feature selector. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe. - numTopFeatures chooses a fixed number of top features according to a chi-squared test. 
- percentile is similar but chooses a fraction of all features instead of a fixed number. 
- fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. 
- fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold. 
- fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. 
 - By default, the selection method is numTopFeatures, with the default number of top features set to 50. - New in version 1.4.0. - Examples - >>> from pyspark.mllib.linalg import SparseVector, DenseVector >>> from pyspark.mllib.regression import LabeledPoint >>> data = sc.parallelize([ ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), ... LabeledPoint(1.0, [0.0, 9.0, 8.0]), ... LabeledPoint(2.0, [7.0, 9.0, 5.0]), ... LabeledPoint(2.0, [8.0, 7.0, 3.0]) ... ]) >>> model = ChiSqSelector(numTopFeatures=1).fit(data) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {}) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) >>> model = ChiSqSelector(selectorType="fpr", fpr=0.2).fit(data) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {}) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) >>> model = ChiSqSelector(selectorType="percentile", percentile=0.34).fit(data) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) - Methods - fit(data)- Returns a ChiSquared feature selector. - setFdr(fdr)- set FDR [0.0, 1.0] for feature selection by FDR. - setFpr(fpr)- set FPR [0.0, 1.0] for feature selection by FPR. - setFwe(fwe)- set FWE [0.0, 1.0] for feature selection by FWE. - setNumTopFeatures(numTopFeatures)- set numTopFeature for feature selection by number of top features. - setPercentile(percentile)- set percentile [0.0, 1.0] for feature selection by percentile. - setSelectorType(selectorType)- set the selector type of the ChisqSelector. - Methods Documentation - 
fit(data: pyspark.rdd.RDD[pyspark.mllib.regression.LabeledPoint]) → pyspark.mllib.feature.ChiSqSelectorModel[source]¶
- Returns a ChiSquared feature selector. - New in version 1.4.0. - Parameters
- datapyspark.RDDofpyspark.mllib.regression.LabeledPoint
- containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. Apply feature discretizer before using this function. 
 
- data
 
 - 
setFdr(fdr: float) → pyspark.mllib.feature.ChiSqSelector[source]¶
- set FDR [0.0, 1.0] for feature selection by FDR. Only applicable when selectorType = “fdr”. - New in version 2.2.0. 
 - 
setFpr(fpr: float) → pyspark.mllib.feature.ChiSqSelector[source]¶
- set FPR [0.0, 1.0] for feature selection by FPR. Only applicable when selectorType = “fpr”. - New in version 2.1.0. 
 - 
setFwe(fwe: float) → pyspark.mllib.feature.ChiSqSelector[source]¶
- set FWE [0.0, 1.0] for feature selection by FWE. Only applicable when selectorType = “fwe”. - New in version 2.2.0. 
 - 
setNumTopFeatures(numTopFeatures: int) → pyspark.mllib.feature.ChiSqSelector[source]¶
- set numTopFeature for feature selection by number of top features. Only applicable when selectorType = “numTopFeatures”. - New in version 2.1.0. 
 - 
setPercentile(percentile: float) → pyspark.mllib.feature.ChiSqSelector[source]¶
- set percentile [0.0, 1.0] for feature selection by percentile. Only applicable when selectorType = “percentile”. - New in version 2.1.0. 
 - 
setSelectorType(selectorType: str) → pyspark.mllib.feature.ChiSqSelector[source]¶
- set the selector type of the ChisqSelector. Supported options: “numTopFeatures” (default), “percentile”, “fpr”, “fdr”, “fwe”. - New in version 2.1.0.