pyspark.RDD.sample¶
- 
RDD.sample(withReplacement: bool, fraction: float, seed: Optional[int] = None) → pyspark.rdd.RDD[T][source]¶
- Return a sampled subset of this RDD. - New in version 0.7.0. - Parameters
- withReplacementbool
- can elements be sampled multiple times (replaced when sampled out) 
- fractionfloat
- expected size of the sample as a fraction of this RDD’s size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be >= 0 
- seedint, optional
- seed for the random number generator 
 
- Returns
 - Notes - This is not guaranteed to provide exactly the fraction specified of the total count of the given - DataFrame.- Examples - >>> rdd = sc.parallelize(range(100), 4) >>> 6 <= rdd.sample(False, 0.1, 81).count() <= 14 True