KMeansModel¶
- 
class pyspark.mllib.clustering.KMeansModel(centers: List[VectorLike])[source]¶
- A clustering model derived from the k-means method. - New in version 0.9.0. - Examples - >>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) >>> model = KMeans.train( ... sc.parallelize(data), 2, maxIterations=10, initializationMode="random", ... seed=50, initializationSteps=5, epsilon=1e-4) >>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) True >>> model.predict(array([8.0, 9.0])) == model.predict(array([9.0, 8.0])) True >>> model.k 2 >>> model.computeCost(sc.parallelize(data)) 2.0 >>> model = KMeans.train(sc.parallelize(data), 2) >>> sparse_data = [ ... SparseVector(3, {1: 1.0}), ... SparseVector(3, {1: 1.1}), ... SparseVector(3, {2: 1.0}), ... SparseVector(3, {2: 1.1}) ... ] >>> model = KMeans.train(sc.parallelize(sparse_data), 2, initializationMode="k-means||", ... seed=50, initializationSteps=5, epsilon=1e-4) >>> model.predict(array([0., 1., 0.])) == model.predict(array([0, 1.1, 0.])) True >>> model.predict(array([0., 0., 1.])) == model.predict(array([0, 0, 1.1])) True >>> model.predict(sparse_data[0]) == model.predict(sparse_data[1]) True >>> model.predict(sparse_data[2]) == model.predict(sparse_data[3]) True >>> isinstance(model.clusterCenters, list) True >>> import os, tempfile >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = KMeansModel.load(sc, path) >>> sameModel.predict(sparse_data[0]) == model.predict(sparse_data[0]) True >>> from shutil import rmtree >>> try: ... rmtree(path) ... except OSError: ... pass - >>> data = array([-383.1,-382.9, 28.7,31.2, 366.2,367.3]).reshape(3, 2) >>> model = KMeans.train(sc.parallelize(data), 3, maxIterations=0, ... initialModel = KMeansModel([(-1000.0,-1000.0),(5.0,5.0),(1000.0,1000.0)])) >>> model.clusterCenters [array([-1000., -1000.]), array([ 5., 5.]), array([ 1000., 1000.])] - Methods - computeCost(rdd)- Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data. - load(sc, path)- Load a model from the given path. - predict(x)- Find the cluster that each of the points belongs to in this model. - save(sc, path)- Save this model to the given path. - Attributes - Get the cluster centers, represented as a list of NumPy arrays. - Total number of clusters. - Methods Documentation - 
computeCost(rdd: pyspark.rdd.RDD[VectorLike]) → float[source]¶
- Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data. - New in version 1.4.0. - Parameters
- rdd:pyspark.RDD
- The RDD of points to compute the cost on. 
 
- rdd:
 
 - 
classmethod load(sc: pyspark.context.SparkContext, path: str) → pyspark.mllib.clustering.KMeansModel[source]¶
- Load a model from the given path. - New in version 1.4.0. 
 - 
predict(x: Union[VectorLike, pyspark.rdd.RDD[VectorLike]]) → Union[int, pyspark.rdd.RDD[int]][source]¶
- Find the cluster that each of the points belongs to in this model. - New in version 0.9.0. - Parameters
- xpyspark.mllib.linalg.Vectororpyspark.RDD
- A data point (or RDD of points) to determine cluster index. - pyspark.mllib.linalg.Vectorcan be replaced with equivalent objects (list, tuple, numpy.ndarray).
 
- x
- Returns
- int or pyspark.RDDof int
- Predicted cluster index or an RDD of predicted cluster indices if the input is an RDD. 
 
- int or 
 
 - 
save(sc: pyspark.context.SparkContext, path: str) → None[source]¶
- Save this model to the given path. - New in version 1.4.0. 
 - Attributes Documentation - 
clusterCenters¶
- Get the cluster centers, represented as a list of NumPy arrays. - New in version 1.0.0. 
 - 
k¶
- Total number of clusters. - New in version 1.4.0. 
 
-