【SparkAPI JAVA版】JavaPairRDD——countByValue、countByValueApprox
【摘要】 JavaPairRDD的countByValue方法讲解 官方文档/** * Return the count of each unique value in this RDD as a map of (value, count) pairs. The final * combine step happens locally on the master, equivalent to...
JavaPairRDD的countByValue方法讲解
官方文档
/**
* Return the count of each unique value in this RDD as a map of (value, count) pairs. The final
* combine step happens locally on the master, equivalent to running a single reduce task.
*/
说明
返回RDD中每个值的计数,作为(value,count)对的映射。
返回的是map
函数原型
// java
public static java.util.Map<T,Long> countByValue()
// scala
def countByValue(): Map[(K, V), Long]
示例
public class CountByValue {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "E:\\hadoop-2.7.1");
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Spark_DEMO");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> javaPairRDD1 = sc.parallelizePairs(Lists.newArrayList(
new Tuple2<String, String>("cat", "11"), new Tuple2<String, String>("dog", "22"),
new Tuple2<String, String>("cat", "11"), new Tuple2<String, String>("pig", "44"),
new Tuple2<String, String>("duck", "55"), new Tuple2<String, String>("cat", "66")), 3);
Map<Tuple2<String, String>, Long> value = javaPairRDD1.countByValue();
for (Map.Entry<Tuple2<String, String>, Long> entry : value.entrySet()){
System.out.println(entry.getKey()+"->"+entry.getValue());
}
}
}
结果
19/03/20 17:15:31 INFO DAGScheduler: Job 0 finished: countByValue at CountByValue.java:23, took 1.093040 s
19/03/20 17:15:31 INFO SparkContext: Invoking stop() from shutdown hook
(duck,55)->1
(dog,22)->1
(pig,44)->1
(cat,66)->1
(cat,11)->2
19/03/20 17:15:31 INFO SparkUI: Stopped Spark web UI at http://10.124.209.6:4040
JavaPairRDD的countByValueApprox方法讲解
官方文档
/**
* Approximate version of countByValue().
*
* The confidence is the probability that the error bounds of the result will
* contain the true value. That is, if countApprox were called repeatedly
* with confidence 0.9, we would expect 90% of the results to contain the
* true count. The confidence must be in the range [0,1] or an exception will
* be thrown.
*
* @param timeout maximum time to wait for the job, in milliseconds
* @param confidence the desired statistical confidence in the result
* @return a potentially incomplete result, with error bounds
*/
说明
CountByValue()的近似版本。
置信度必须在[0,1]范围内,否则异常将被扔掉。
*@参数超时等待作业的最长时间(毫秒)
*@参数置信度结果中所需的统计置信度
*@返回一个可能不完整的结果,带有错误界限
函数原型
// java
public static PartialResult<java.util.Map<T,BoundedDouble>> countByValueApprox(long timeout)
public static PartialResult<java.util.Map<T,BoundedDouble>> countByValueApprox(long timeout,
double confidence)
// scala
def countByValueApprox(timeout: Long): PartialResult[Map[(K, V), BoundedDouble]]
def countByValueApprox(timeout: Long, confidence: Double): PartialResult[Map[(K, V), BoundedDouble]]
【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)