PowerIterationClustering¶
- 
class pyspark.mllib.clustering.PowerIterationClustering[source]¶
- Power Iteration Clustering (PIC), a scalable graph clustering algorithm. - Developed by Lin and Cohen [1]. From the abstract: - “PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.” - New in version 1.5.0. - 1
- Lin, Frank & Cohen, William. (2010). Power Iteration Clustering. http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf 
 - Methods - train(rdd, k[, maxIterations, initMode])- Train PowerIterationClusteringModel - Methods Documentation - 
classmethod train(rdd: pyspark.rdd.RDD[Tuple[int, int, float]], k: int, maxIterations: int = 100, initMode: str = 'random') → pyspark.mllib.clustering.PowerIterationClusteringModel[source]¶
- Train PowerIterationClusteringModel - New in version 1.5.0. - Parameters
- rddpyspark.RDD
- An RDD of (i, j, sij) tuples representing the affinity matrix, which is the matrix A in the PIC paper. The similarity sijmust be nonnegative. This is a symmetric matrix and hence sij= sji For any (i, j) with nonzero similarity, there should be either (i, j, sij) or (j, i, sji) in the input. Tuples with i = j are ignored, because it is assumed sij= 0.0. 
- kint
- Number of clusters. 
- maxIterationsint, optional
- Maximum number of iterations of the PIC algorithm. (default: 100) 
- initModestr, optional
- Initialization mode. This can be either “random” to use a random vector as vertex properties, or “degree” to use normalized sum similarities. (default: “random”) 
 
- rdd