pyspark.RDD.groupBy¶
- 
RDD.groupBy(f: Callable[[T], K], numPartitions: Optional[int] = None, partitionFunc: Callable[[K], int] = <function portable_hash>) → pyspark.rdd.RDD[Tuple[K, Iterable[T]]][source]¶
- Return an RDD of grouped items. - New in version 0.7.0. - Parameters
- ffunction
- a function to compute the key 
- numPartitionsint, optional
- the number of partitions in new - RDD
- partitionFuncfunction, optional, default portable_hash
- a function to compute the partition index 
 
- Returns
 - Examples - >>> rdd = sc.parallelize([1, 1, 2, 3, 5, 8]) >>> result = rdd.groupBy(lambda x: x % 2).collect() >>> sorted([(x, sorted(y)) for (x, y) in result]) [(0, [2, 8]), (1, [1, 1, 3, 5])]