HashingTF¶
- 
class pyspark.mllib.feature.HashingTF(numFeatures: int = 1048576)[source]¶
- Maps a sequence of terms to their term frequencies using the hashing trick. - New in version 1.2.0. - Parameters
- numFeaturesint, optional
- number of features (default: 2^20) 
 
 - Notes - The terms must be hashable (can not be dict/set/list…). - Examples - >>> htf = HashingTF(100) >>> doc = "a a b b c d".split(" ") >>> htf.transform(doc) SparseVector(100, {...}) - Methods - indexOf(term)- Returns the index of the input term. - setBinary(value)- If True, term frequency vector will be binary such that non-zero term counts will be set to 1 (default: False) - transform(document)- Transforms the input document (list of terms) to term frequency vectors, or transform the RDD of document to RDD of term frequency vectors. - Methods Documentation - 
setBinary(value: bool) → pyspark.mllib.feature.HashingTF[source]¶
- If True, term frequency vector will be binary such that non-zero term counts will be set to 1 (default: False) - New in version 2.0.0. 
 - 
transform(document: Union[Iterable[Hashable], pyspark.rdd.RDD[Iterable[Hashable]]]) → Union[pyspark.mllib.linalg.Vector, pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]][source]¶
- Transforms the input document (list of terms) to term frequency vectors, or transform the RDD of document to RDD of term frequency vectors. - New in version 1.2.0.