HarmonyOS APP开发:推荐效果评估与A/B测试
HarmonyOS APP开发:推荐效果评估与A/B测试
核心要点:推荐系统做得好不好,不是"感觉"出来的,而是"测"出来的。本文深入讲解推荐系统的核心评估指标(准确率、召回率、NDCG、覆盖率等),并在HarmonyOS端侧实现完整的A/B测试框架,涵盖实验分组、指标采集、统计显著性检验和实验决策全流程。
| 项目 | 说明 |
|---|---|
| 开发语言 | ArkTS |
| 关键能力 | 指标计算、A/B实验、统计检验、数据上报 |
一、背景与动机
你上线了一个新的推荐算法,老板问你:"效果怎么样?"你怎么回答?
- “感觉推荐更准了”——不靠谱
- “用户点击率提升了”——好一点,但提升多少?是随机波动还是真实提升?
- “A/B测试显示点击率提升了12.3%,p值<0.05”——这才叫专业!
推荐效果评估是推荐系统的"体检报告",而A/B测试是"双盲实验"。没有评估,推荐系统的迭代就是"蒙眼狂奔";没有A/B测试,你永远不知道新算法是否真的比旧算法好。
在HarmonyOS端侧实现A/B测试,有其独特的挑战和优势:
挑战:端侧样本量有限(单设备只有1个用户),统计检验需要聚合多设备数据。
优势:端侧可以采集更精细的行为数据(如停留时长、滚动深度),评估更全面。
二、核心原理
2.1 推荐评估指标体系
flowchart TB
classDef primary fill:#4A90D9,stroke:#2C5F8A,color:#fff,font-weight:bold
classDef warning fill:#F5A623,stroke:#C7841A,color:#fff,font-weight:bold
classDef error fill:#D0021B,stroke:#9B0214,color:#fff,font-weight:bold
classDef info fill:#7B68EE,stroke:#5B48CE,color:#fff,font-weight:bold
classDef purple fill:#9B59B6,stroke:#7D3C98,color:#fff,font-weight:bold
A[推荐评估指标]:::primary --> B[准确度指标]:::info
A --> C[排序质量指标]:::warning
A --> D[多样性指标]:::purple
A --> E[业务指标]:::error
B --> B1[准确率 Precision]:::info
B --> B2[召回率 Recall]:::info
B --> B3[F1-Score]:::info
C --> C1[NDCG]:::warning
C --> C2[MRR]:::warning
C --> C3[Hit Rate]:::warning
D --> D1[覆盖率 Coverage]:::purple
D --> D2[多样性 Diversity]:::purple
D --> D3[新颖性 Novelty]:::purple
E --> E1[点击率 CTR]:::error
E --> E2[转化率 CVR]:::error
E --> E3[人均消费]:::error
2.2 核心指标详解
准确率与召回率
| 指标 | 公式 | 含义 |
|---|---|---|
| Precision@K | 推荐列表中用户喜欢的比例 | |
| Recall@K | 用户喜欢的物品被推荐出的比例 | |
| F1@K | 准确率和召回率的调和平均 |
其中 是推荐集合, 是用户实际喜欢的物品集合。
NDCG(归一化折损累积增益)
NDCG考虑了推荐位置的权重——排在前面的推荐更重要:
其中 是第i个位置的物品相关度, 是理想排序的DCG。
覆盖率与多样性
- 覆盖率:被推荐过的物品占总物品的比例,衡量推荐系统发掘长尾内容的能力
- 多样性:推荐列表中物品之间的不相似程度
- 新颖性:推荐物品的平均流行度的倒数,越冷门的物品新颖性越高
2.3 A/B测试原理
A/B测试的核心思想:将用户随机分为实验组和对照组,只改变一个变量,对比结果差异。
flowchart LR
classDef primary fill:#4A90D9,stroke:#2C5F8A,color:#fff,font-weight:bold
classDef warning fill:#F5A623,stroke:#C7841A,color:#fff,font-weight:bold
classDef error fill:#D0021B,stroke:#9B0214,color:#fff,font-weight:bold
classDef info fill:#7B68EE,stroke:#5B48CE,color:#fff,font-weight:bold
classDef purple fill:#9B59B6,stroke:#7D3C98,color:#fff,font-weight:bold
A[全部用户]:::primary --> B[对照组 A]:::info
A --> C[实验组 B]:::warning
B --> D[旧推荐算法]:::info
C --> E[新推荐算法]:::warning
D --> F[指标采集]:::error
E --> G[指标采集]:::error
F --> H[统计显著性检验]:::purple
G --> H
H --> I{p < 0.05?}:::purple
I -->|是| J[新算法上线]:::primary
I -->|否| K[继续观察或放弃]:::error
2.4 统计显著性检验
A/B测试最常用的检验方法是双样本t检验:
当p值<0.05时,认为差异具有统计显著性。
三、代码实战
3.1 推荐评估指标计算器
先实现核心的评估指标计算模块。
// RecommendationMetrics.ets - 推荐评估指标计算器
/**
* 评估数据点
*/
export interface EvaluationDataPoint {
userId: string
recommendedItems: string[] // 推荐列表
relevantItems: string[] // 实际相关(用户喜欢)的物品
ratings?: Map<string, number> // 物品评分(用于NDCG计算)
}
/**
* 评估结果
*/
export interface EvaluationResult {
precision: number
recall: number
f1Score: number
ndcg: number
mrr: number
hitRate: number
coverage: number
diversity: number
novelty: number
sampleSize: number
}
/**
* 推荐评估指标计算器
* 提供离线和在线评估指标的计算能力
*/
export class RecommendationMetrics {
// 全局物品流行度(用于新颖性计算)
private itemPopularity: Map<string, number> = new Map()
// 物品特征(用于多样性计算)
private itemFeatures: Map<string, string[]> = new Map()
/**
* 设置物品流行度数据
*/
setItemPopularity(popularity: Map<string, number>): void {
this.itemPopularity = popularity
}
/**
* 设置物品特征数据(用于多样性计算)
*/
setItemFeatures(features: Map<string, string[]>): void {
this.itemFeatures = features
}
/**
* 计算Precision@K
* 推荐列表中用户实际喜欢的物品比例
*/
static precisionAtK(dataPoint: EvaluationDataPoint, k: number = 10): number {
const recommended = dataPoint.recommendedItems.slice(0, k)
const relevantSet = new Set(dataPoint.relevantItems)
let hitCount = 0
for (const item of recommended) {
if (relevantSet.has(item)) hitCount++
}
return recommended.length > 0 ? hitCount / recommended.length : 0
}
/**
* 计算Recall@K
* 用户实际喜欢的物品被推荐出的比例
*/
static recallAtK(dataPoint: EvaluationDataPoint, k: number = 10): number {
const recommended = new Set(dataPoint.recommendedItems.slice(0, k))
const relevant = dataPoint.relevantItems
if (relevant.length === 0) return 0
let hitCount = 0
for (const item of relevant) {
if (recommended.has(item)) hitCount++
}
return hitCount / relevant.length
}
/**
* 计算F1-Score@K
*/
static f1ScoreAtK(dataPoint: EvaluationDataPoint, k: number = 10): number {
const precision = RecommendationMetrics.precisionAtK(dataPoint, k)
const recall = RecommendationMetrics.recallAtK(dataPoint, k)
if (precision + recall === 0) return 0
return 2 * precision * recall / (precision + recall)
}
/**
* 计算NDCG@K(归一化折损累积增益)
* 考虑推荐位置的权重
*/
static ndcgAtK(dataPoint: EvaluationDataPoint, k: number = 10): number {
const recommended = dataPoint.recommendedItems.slice(0, k)
const relevantSet = new Set(dataPoint.relevantItems)
// 计算DCG
let dcg = 0
for (let i = 0; i < recommended.length; i++) {
const relevance = relevantSet.has(recommended[i]) ? 1 : 0
// 使用评分(如果有)
const rating = dataPoint.ratings?.get(recommended[i])
const rel = rating !== undefined ? rating : relevance
dcg += rel / Math.log2(i + 2) // i+2 因为位置从1开始
}
// 计算IDCG(理想排序)
const idealRelevances: number[] = []
for (const item of dataPoint.relevantItems.slice(0, k)) {
const rating = dataPoint.ratings?.get(item)
idealRelevances.push(rating !== undefined ? rating : 1)
}
// 如果相关物品不足K个,用0填充
while (idealRelevances.length < k) {
idealRelevances.push(0)
}
idealRelevances.sort((a, b) => b - a)
let idcg = 0
for (let i = 0; i < idealRelevances.length; i++) {
idcg += idealRelevances[i] / Math.log2(i + 2)
}
return idcg > 0 ? dcg / idcg : 0
}
/**
* 计算MRR(平均倒数排名)
* 第一个相关物品出现的位置的倒数
*/
static mrr(dataPoint: EvaluationDataPoint): number {
const relevantSet = new Set(dataPoint.relevantItems)
for (let i = 0; i < dataPoint.recommendedItems.length; i++) {
if (relevantSet.has(dataPoint.recommendedItems[i])) {
return 1 / (i + 1)
}
}
return 0
}
/**
* 计算Hit Rate@K
* 至少有一个相关物品被推荐的用户比例
*/
static hitRateAtK(dataPoint: EvaluationDataPoint, k: number = 10): number {
const recommended = new Set(dataPoint.recommendedItems.slice(0, k))
const relevantSet = new Set(dataPoint.relevantItems)
for (const item of relevantSet) {
if (recommended.has(item)) return 1
}
return 0
}
/**
* 计算覆盖率
* 被推荐过的物品占总物品的比例
*/
calculateCoverage(
allDataPoints: EvaluationDataPoint[],
totalItems: number
): number {
const recommendedItems = new Set<string>()
for (const dp of allDataPoints) {
for (const item of dp.recommendedItems) {
recommendedItems.add(item)
}
}
return totalItems > 0 ? recommendedItems.size / totalItems : 0
}
/**
* 计算推荐列表多样性
* 使用物品特征的平均不相似度
*/
calculateDiversity(
dataPoint: EvaluationDataPoint,
k: number = 10
): number {
const recommended = dataPoint.recommendedItems.slice(0, k)
if (recommended.length < 2) return 0
let totalDissimilarity = 0
let pairCount = 0
for (let i = 0; i < recommended.length; i++) {
for (let j = i + 1; j < recommended.length; j++) {
const featuresA = this.itemFeatures.get(recommended[i]) || []
const featuresB = this.itemFeatures.get(recommended[j]) || []
// Jaccard距离 = 1 - Jaccard相似度
const similarity = this.jaccardSimilarity(featuresA, featuresB)
totalDissimilarity += (1 - similarity)
pairCount++
}
}
return pairCount > 0 ? totalDissimilarity / pairCount : 0
}
/**
* 计算推荐列表新颖性
* 推荐物品的平均流行度的倒数
*/
calculateNovelty(dataPoint: EvaluationDataPoint, k: number = 10): number {
const recommended = dataPoint.recommendedItems.slice(0, k)
if (recommended.length === 0) return 0
let totalSelfInfo = 0
const totalUsers = Array.from(this.itemPopularity.values())
.reduce((sum, count) => sum + count, 0)
for (const item of recommended) {
const popularity = this.itemPopularity.get(item) || 1
// 自信息:-log(p),p为物品被推荐的概率
const probability = popularity / Math.max(totalUsers, 1)
totalSelfInfo += -Math.log2(Math.max(probability, 0.0001))
}
return totalSelfInfo / recommended.length
}
/**
* 批量评估:计算所有指标的平均值
*/
evaluate(
allDataPoints: EvaluationDataPoint[],
k: number = 10,
totalItems: number = 100
): EvaluationResult {
if (allDataPoints.length === 0) {
return {
precision: 0, recall: 0, f1Score: 0, ndcg: 0,
mrr: 0, hitRate: 0, coverage: 0, diversity: 0,
novelty: 0, sampleSize: 0,
}
}
let sumPrecision = 0, sumRecall = 0, sumF1 = 0
let sumNdcg = 0, sumMrr = 0, sumHitRate = 0
let sumDiversity = 0, sumNovelty = 0
for (const dp of allDataPoints) {
sumPrecision += RecommendationMetrics.precisionAtK(dp, k)
sumRecall += RecommendationMetrics.recallAtK(dp, k)
sumF1 += RecommendationMetrics.f1ScoreAtK(dp, k)
sumNdcg += RecommendationMetrics.ndcgAtK(dp, k)
sumMrr += RecommendationMetrics.mrr(dp)
sumHitRate += RecommendationMetrics.hitRateAtK(dp, k)
sumDiversity += this.calculateDiversity(dp, k)
sumNovelty += this.calculateNovelty(dp, k)
}
const n = allDataPoints.length
return {
precision: sumPrecision / n,
recall: sumRecall / n,
f1Score: sumF1 / n,
ndcg: sumNdcg / n,
mrr: sumMrr / n,
hitRate: sumHitRate / n,
coverage: this.calculateCoverage(allDataPoints, totalItems),
diversity: sumDiversity / n,
novelty: sumNovelty / n,
sampleSize: n,
}
}
/**
* Jaccard相似度辅助方法
*/
private jaccardSimilarity(setA: string[], setB: string[]): number {
const aSet = new Set(setA)
const bSet = new Set(setB)
let intersection = 0
aSet.forEach(item => {
if (bSet.has(item)) intersection++
})
const union = aSet.size + bSet.size - intersection
return union > 0 ? intersection / union : 0
}
}
3.2 A/B测试框架
实现完整的A/B测试框架,包括实验分组、指标采集和统计检验。
// ABTestFramework.ets - A/B测试框架
/**
* 实验配置
*/
export interface ExperimentConfig {
experimentId: string // 实验ID
name: string // 实验名称
description: string // 实验描述
variants: VariantConfig[] // 实验变体配置
metrics: string[] // 关注的指标
sampleRatio: number // 采样比例(0-1)
minSampleSize: number // 最小样本量
confidenceLevel: number // 置信水平(0.95 = 95%)
startDate: number // 开始时间
endDate: number // 结束时间
}
/**
* 实验变体配置
*/
export interface VariantConfig {
variantId: string // 变体ID
name: string // 变体名称
ratio: number // 流量比例(0-1)
isControl: boolean // 是否为对照组
params: Record<string, Object> // 变体参数
}
/**
* 实验指标数据
*/
export interface ExperimentMetrics {
variantId: string
userId: string
metricName: string
value: number
timestamp: number
}
/**
* 实验结果
*/
export interface ExperimentResult {
experimentId: string
metricName: string
controlMean: number
treatmentMean: number
controlStd: number
treatmentStd: number
controlSampleSize: number
treatmentSampleSize: number
absoluteDiff: number
relativeDiff: number
tStatistic: number
pValue: number
isSignificant: boolean
confidenceInterval: [number, number]
}
/**
* A/B测试框架
* 提供实验分组、指标采集、统计检验的完整能力
*/
export class ABTestFramework {
// 活跃实验列表
private experiments: Map<string, ExperimentConfig> = new Map()
// 用户分组映射:userId -> { experimentId -> variantId }
private userAssignments: Map<string, Map<string, string>> = new Map()
// 指标数据存储
private metricsStore: ExperimentMetrics[] = []
// 最大存储量
private maxMetricsStore: number = 100000
/**
* 注册实验
*/
registerExperiment(config: ExperimentConfig): void {
// 验证变体比例之和为1
const totalRatio = config.variants.reduce((sum, v) => sum + v.ratio, 0)
if (Math.abs(totalRatio - 1.0) > 0.01) {
console.error(`[ABTest] 实验变体比例之和必须为1,当前: ${totalRatio}`)
return
}
// 验证有且仅有一个对照组
const controlCount = config.variants.filter(v => v.isControl).length
if (controlCount !== 1) {
console.error(`[ABTest] 实验必须有且仅有一个对照组,当前: ${controlCount}`)
return
}
this.experiments.set(config.experimentId, config)
console.info(`[ABTest] 注册实验: ${config.name} (${config.experimentId})`)
}
/**
* 为用户分配实验变体
* 使用一致性哈希确保同一用户始终分配到同一变体
*/
assignVariant(userId: string, experimentId: string): string {
// 检查缓存
const userExps = this.userAssignments.get(userId)
if (userExps && userExps.has(experimentId)) {
return userExps.get(experimentId)!
}
const config = this.experiments.get(experimentId)
if (!config) {
console.warn(`[ABTest] 未找到实验: ${experimentId}`)
return ''
}
// 检查实验是否在有效期内
const now = Date.now()
if (now < config.startDate || now > config.endDate) {
console.warn(`[ABTest] 实验不在有效期内: ${experimentId}`)
return ''
}
// 采样检查
const userHash = this.hashUserId(userId, experimentId)
if (userHash > config.sampleRatio) {
return '' // 不参与实验
}
// 根据比例分配变体
const variantHash = this.hashUserId(userId, experimentId + '_variant')
let cumulativeRatio = 0
let assignedVariant = config.variants[0].variantId
for (const variant of config.variants) {
cumulativeRatio += variant.ratio
if (variantHash <= cumulativeRatio) {
assignedVariant = variant.variantId
break
}
}
// 缓存分配结果
if (!this.userAssignments.has(userId)) {
this.userAssignments.set(userId, new Map())
}
this.userAssignments.get(userId)!.set(experimentId, assignedVariant)
console.info(`[ABTest] 用户 ${userId} 分配到实验 ${experimentId} 的变体 ${assignedVariant}`)
return assignedVariant
}
/**
* 获取用户的实验变体参数
*/
getVariantParams(userId: string, experimentId: string): Record<string, Object> | null {
const variantId = this.assignVariant(userId, experimentId)
if (!variantId) return null
const config = this.experiments.get(experimentId)
const variant = config?.variants.find(v => v.variantId === variantId)
return variant?.params || null
}
/**
* 记录实验指标
*/
recordMetric(
userId: string,
experimentId: string,
metricName: string,
value: number
): void {
const variantId = this.assignVariant(userId, experimentId)
if (!variantId) return
const metric: ExperimentMetrics = {
variantId,
userId,
metricName,
value,
timestamp: Date.now(),
}
this.metricsStore.push(metric)
// 超出存储上限时淘汰旧数据
if (this.metricsStore.length > this.maxMetricsStore) {
this.metricsStore = this.metricsStore.slice(-this.maxMetricsStore / 2)
}
}
/**
* 分析实验结果
* 执行双样本t检验
*/
analyzeExperiment(experimentId: string, metricName: string): ExperimentResult | null {
const config = this.experiments.get(experimentId)
if (!config) return null
// 找出对照组和实验组
const controlVariant = config.variants.find(v => v.isControl)!
const treatmentVariants = config.variants.filter(v => !v.isControl)
if (treatmentVariants.length === 0) return null
// 取第一个实验组(简化处理,多实验组需多次比较)
const treatmentVariant = treatmentVariants[0]
// 收集指标数据
const controlValues: number[] = []
const treatmentValues: number[] = []
for (const metric of this.metricsStore) {
if (metric.metricName !== metricName) continue
// 聚合同一用户的多次指标值(取平均)
if (metric.variantId === controlVariant.variantId) {
controlValues.push(metric.value)
} else if (metric.variantId === treatmentVariant.variantId) {
treatmentValues.push(metric.value)
}
}
if (controlValues.length < 2 || treatmentValues.length < 2) {
console.warn(`[ABTest] 样本量不足: 对照组${controlValues.length}, 实验组${treatmentValues.length}`)
return null
}
// 计算统计量
const controlMean = this.mean(controlValues)
const treatmentMean = this.mean(treatmentValues)
const controlStd = this.stdDev(controlValues)
const treatmentStd = this.stdDev(treatmentValues)
const controlN = controlValues.length
const treatmentN = treatmentValues.length
// 双样本t检验
const pooledStdError = Math.sqrt(
(controlStd * controlStd / controlN) +
(treatmentStd * treatmentStd / treatmentN)
)
const tStatistic = pooledStdError > 0
? (treatmentMean - controlMean) / pooledStdError
: 0
// 近似p值(使用正态分布近似)
const pValue = this.approximatePValue(tStatistic, controlN + treatmentN - 2)
// 置信区间
const zScore = this.getZScore(config.confidenceLevel)
const marginOfError = zScore * pooledStdError
const diff = treatmentMean - controlMean
const confidenceInterval: [number, number] = [
diff - marginOfError,
diff + marginOfError,
]
const isSignificant = pValue < (1 - config.confidenceLevel)
return {
experimentId,
metricName,
controlMean,
treatmentMean,
controlStd,
treatmentStd,
controlSampleSize: controlN,
treatmentSampleSize: treatmentN,
absoluteDiff: diff,
relativeDiff: controlMean !== 0 ? diff / Math.abs(controlMean) : 0,
tStatistic,
pValue,
isSignificant,
confidenceInterval,
}
}
/**
* 获取实验的所有指标分析结果
*/
analyzeAllMetrics(experimentId: string): ExperimentResult[] {
const config = this.experiments.get(experimentId)
if (!config) return []
const results: ExperimentResult[] = []
for (const metricName of config.metrics) {
const result = this.analyzeExperiment(experimentId, metricName)
if (result) results.push(result)
}
return results
}
// ======== 统计工具方法 ========
/**
* 一致性哈希函数
* 确保同一用户始终分配到同一变体
*/
private hashUserId(userId: string, salt: string): number {
let hash = 0
const str = userId + salt
for (let i = 0; i < str.length; i++) {
const char = str.charCodeAt(i)
hash = ((hash << 5) - hash) + char
hash = hash & hash // 转为32位整数
}
return (Math.abs(hash) % 10000) / 10000 // 归一化到0-1
}
/**
* 计算均值
*/
private mean(values: number[]): number {
if (values.length === 0) return 0
return values.reduce((sum, v) => sum + v, 0) / values.length
}
/**
* 计算标准差
*/
private stdDev(values: number[]): number {
if (values.length < 2) return 0
const avg = this.mean(values)
const variance = values.reduce((sum, v) => sum + (v - avg) ** 2, 0) / (values.length - 1)
return Math.sqrt(variance)
}
/**
* 近似p值
* 使用正态分布近似t分布
*/
private approximatePValue(tStat: number, df: number): number {
// 使用近似公式:p ≈ 2 * (1 - Φ(|t|))
// Φ为标准正态分布的CDF
const z = Math.abs(tStat)
// 使用近似公式计算标准正态CDF
const p = 2 * (1 - this.normalCDF(z))
return p
}
/**
* 标准正态分布CDF近似
* 使用Abramowitz and Stegun近似公式
*/
private normalCDF(z: number): number {
if (z < 0) return 1 - this.normalCDF(-z)
const b0 = 0.2316419
const b1 = 0.319381530
const b2 = -0.356563782
const b3 = 1.781477937
const b4 = -1.821255978
const b5 = 1.330274429
const t = 1 / (1 + b0 * z)
const t2 = t * t
const t3 = t2 * t
const t4 = t3 * t
const t5 = t4 * t
const pdf = Math.exp(-z * z / 2) / Math.sqrt(2 * Math.PI)
const cdf = 1 - pdf * (b1 * t + b2 * t2 + b3 * t3 + b4 * t4 + b5 * t5)
return Math.min(cdf, 1.0)
}
/**
* 获取Z分数(置信水平对应的Z值)
*/
private getZScore(confidenceLevel: number): number {
const zScores: Record<number, number> = {
0.90: 1.645,
0.95: 1.960,
0.99: 2.576,
}
return zScores[confidenceLevel] || 1.960
}
/**
* 计算所需最小样本量
* 基于效应量和统计功效
*/
calculateMinSampleSize(
baselineRate: number,
minimumDetectableEffect: number,
confidenceLevel: number = 0.95,
statisticalPower: number = 0.8
): number {
const zAlpha = this.getZScore(confidenceLevel)
const zBeta = this.getZScore(statisticalPower)
const p1 = baselineRate
const p2 = baselineRate * (1 + minimumDetectableEffect)
const pAvg = (p1 + p2) / 2
const numerator = Math.sqrt(2 * pAvg * (1 - pAvg)) * (zAlpha + zBeta)
const denominator = Math.abs(p2 - p1)
if (denominator === 0) return Infinity
const sampleSize = Math.ceil(Math.pow(numerator / denominator, 2))
return sampleSize
}
}
3.3 评估与A/B测试页面
将评估指标和A/B测试整合到可视化页面中。
// EvaluationPage.ets - 推荐评估与A/B测试页面
import { RecommendationMetrics, EvaluationDataPoint, EvaluationResult } from './RecommendationMetrics'
import { ABTestFramework, ExperimentConfig, ExperimentResult } from './ABTestFramework'
@Entry
@Component
struct EvaluationPage {
private metrics: RecommendationMetrics = new RecommendationMetrics()
private abTest: ABTestFramework = new ABTestFramework()
@State evalResultsA: EvaluationResult | null = null
@State evalResultsB: EvaluationResult | null = null
@State abResults: ExperimentResult[] = []
@State isEvaluating: boolean = false
@State activeTab: number = 0 // 0-指标评估 1-A/B测试
// 模拟评估数据
private mockDataA: EvaluationDataPoint[] = []
private mockDataB: EvaluationDataPoint[] = []
aboutToAppear(): void {
this.setupMockData()
this.setupABTest()
}
private setupMockData(): void {
// 模拟算法A的评估数据(旧算法)
const items = ['i1', 'i2', 'i3', 'i4', 'i5', 'i6', 'i7', 'i8', 'i9', 'i10']
const users = ['u1', 'u2', 'u3', 'u4', 'u5', 'u6', 'u7', 'u8', 'u9', 'u10']
for (const userId of users) {
// 算法A:推荐准确率较低
const shuffled = [...items].sort(() => Math.random() - 0.5)
const recommended = shuffled.slice(0, 6)
// 用户实际喜欢(随机3-5个)
const relevantCount = 3 + Math.floor(Math.random() * 3)
const relevant = [...items].sort(() => Math.random() - 0.5).slice(0, relevantCount)
this.mockDataA.push({ userId, recommendedItems: recommended, relevantItems: relevant })
// 算法B:推荐准确率较高(有60%概率命中相关物品)
const recommendedB: string[] = []
for (let i = 0; i < 6; i++) {
if (Math.random() < 0.6 && i < relevant.length) {
recommendedB.push(relevant[i])
} else {
const otherItems = items.filter(it => !relevant.includes(it) && !recommendedB.includes(it))
recommendedB.push(otherItems[0] || items[i])
}
}
this.mockDataB.push({ userId, recommendedItems: recommendedB, relevantItems: relevant })
}
// 设置物品流行度
const popularity = new Map<string, number>()
items.forEach((item, idx) => popularity.set(item, 10 + idx * 5))
this.metrics.setItemPopularity(popularity)
// 设置物品特征
const features = new Map<string, string[]>()
items.forEach((item, idx) => {
const tagPool = ['科技', '生活', '娱乐', '编程', '音乐', '旅行', '美食', '电影']
features.set(item, [tagPool[idx % tagPool.length], tagPool[(idx + 3) % tagPool.length]])
})
this.metrics.setItemFeatures(features)
}
private setupABTest(): void {
const config: ExperimentConfig = {
experimentId: 'exp_001',
name: '推荐算法V2测试',
description: '对比旧推荐算法和新推荐算法的效果差异',
variants: [
{ variantId: 'control', name: '旧算法', ratio: 0.5, isControl: true, params: { algorithm: 'v1' } },
{ variantId: 'treatment', name: '新算法', ratio: 0.5, isControl: false, params: { algorithm: 'v2' } },
],
metrics: ['ctr', 'conversion_rate', 'avg_dwell_time'],
sampleRatio: 1.0,
minSampleSize: 100,
confidenceLevel: 0.95,
startDate: Date.now() - 7 * 86400000,
endDate: Date.now() + 7 * 86400000,
}
this.abTest.registerExperiment(config)
// 模拟A/B测试数据
this.simulateABTestData()
}
private simulateABTestData(): void {
const users = Array.from({ length: 50 }, (_, i) => `user_${i}`)
for (const userId of users) {
const variant = this.abTest.assignVariant(userId, 'exp_001')
// 模拟指标数据
const baseCTR = variant === 'control' ? 0.12 : 0.15 // 新算法CTR更高
const baseCVR = variant === 'control' ? 0.03 : 0.04
const baseDwellTime = variant === 'control' ? 45 : 55
// 添加随机噪声
const noise = () => (Math.random() - 0.5) * 0.1
this.abTest.recordMetric(userId, 'exp_001', 'ctr', Math.max(0, baseCTR + noise()))
this.abTest.recordMetric(userId, 'exp_001', 'conversion_rate', Math.max(0, baseCVR + noise() * 0.5))
this.abTest.recordMetric(userId, 'exp_001', 'avg_dwell_time', Math.max(0, baseDwellTime + noise() * 30))
}
}
private runEvaluation(): void {
this.isEvaluating = true
setTimeout(() => {
this.evalResultsA = this.metrics.evaluate(this.mockDataA, 6, 10)
this.evalResultsB = this.metrics.evaluate(this.mockDataB, 6, 10)
this.abResults = this.abTest.analyzeAllMetrics('exp_001')
this.isEvaluating = false
}, 500)
}
build() {
Navigation() {
Column() {
// 模式切换
this.TabBar()
// 评估按钮
Button('运行评估分析')
.fontSize(14)
.fontColor('#FFFFFF')
.backgroundColor('#7B68EE')
.borderRadius(20)
.padding({ left: 24, right: 24, top: 8, bottom: 8 })
.enabled(!this.isEvaluating)
.margin({ top: 12, bottom: 12 })
.onClick(() => this.runEvaluation())
if (this.isEvaluating) {
this.LoadingView()
} else if (this.activeTab === 0) {
this.MetricsView()
} else {
this.ABTestView()
}
}
.width('100%')
.height('100%')
.backgroundColor('#0f0f1a')
}
.title('推荐评估')
.titleMode(NavigationTitleMode.Mini)
.navBarStyle(NavigationBarStyle.Constant)
}
// ======== 子组件 ========
@Builder
TabBar() {
Row() {
Text('指标评估')
.fontSize(14)
.fontColor(this.activeTab === 0 ? '#7B68EE' : '#999999')
.fontWeight(this.activeTab === 0 ? FontWeight.Bold : FontWeight.Normal)
.padding({ left: 16, right: 16, top: 6, bottom: 6 })
.borderRadius(14)
.backgroundColor(this.activeTab === 0 ? 'rgba(123,104,238,0.15)' : 'transparent')
.onClick(() => { this.activeTab = 0 })
Text('A/B测试')
.fontSize(14)
.fontColor(this.activeTab === 1 ? '#7B68EE' : '#999999')
.fontWeight(this.activeTab === 1 ? FontWeight.Bold : FontWeight.Normal)
.padding({ left: 16, right: 16, top: 6, bottom: 6 })
.borderRadius(14)
.backgroundColor(this.activeTab === 1 ? 'rgba(123,104,238,0.15)' : 'transparent')
.onClick(() => { this.activeTab = 1 })
}
.width('100%')
.justifyContent(FlexAlign.Center)
.padding({ top: 8 })
}
@Builder
LoadingView() {
Column() {
LoadingProgress().width(40).height(40).color('#7B68EE')
Text('正在计算评估指标...').fontSize(13).fontColor('#999999').margin({ top: 8 })
}
.width('100%').height('50%')
.justifyContent(HorizontalAlign.Center)
}
@Builder
MetricsView() {
if (this.evalResultsA && this.evalResultsB) {
Scroll() {
Column() {
// 算法对比
Text('算法A vs 算法B 指标对比')
.fontSize(16)
.fontColor('#E0E0E0')
.fontWeight(FontWeight.Bold)
.margin({ bottom: 12 })
this.MetricComparisonRow('Precision@6', this.evalResultsA.precision, this.evalResultsB.precision)
this.MetricComparisonRow('Recall@6', this.evalResultsA.recall, this.evalResultsB.recall)
this.MetricComparisonRow('F1@6', this.evalResultsA.f1Score, this.evalResultsB.f1Score)
this.MetricComparisonRow('NDCG@6', this.evalResultsA.ndcg, this.evalResultsB.ndcg)
this.MetricComparisonRow('MRR', this.evalResultsA.mrr, this.evalResultsB.mrr)
this.MetricComparisonRow('HitRate@6', this.evalResultsA.hitRate, this.evalResultsB.hitRate)
this.MetricComparisonRow('覆盖率', this.evalResultsA.coverage, this.evalResultsB.coverage)
this.MetricComparisonRow('多样性', this.evalResultsA.diversity, this.evalResultsB.diversity)
this.MetricComparisonRow('新颖性', this.evalResultsA.novelty, this.evalResultsB.novelty)
Text(`样本量: ${this.evalResultsA.sampleSize}`)
.fontSize(12)
.fontColor('#888888')
.margin({ top: 12 })
}
.padding({ left: 16, right: 16, top: 8 })
}
.layoutWeight(1)
} else {
Column() {
Text('点击"运行评估分析"查看结果')
.fontSize(14).fontColor('#999999')
}
.width('100%').height('50%')
.justifyContent(HorizontalAlign.Center)
}
}
@Builder
MetricComparisonRow(name: string, valueA: number, valueB: number) {
Row() {
Text(name)
.fontSize(13)
.fontColor('#CCCCCC')
.width(90)
// 算法A
Text(valueA.toFixed(4))
.fontSize(13)
.fontColor('#F5A623')
.width(70)
.textAlign(TextAlign.Center)
// 算法B
Text(valueB.toFixed(4))
.fontSize(13)
.fontColor(valueB > valueA ? '#4CAF50' : '#D0021B')
.width(70)
.textAlign(TextAlign.Center)
// 变化
Text(`${valueB > valueA ? '↑' : '↓'} ${Math.abs(((valueB - valueA) / Math.max(valueA, 0.0001)) * 100).toFixed(1)}%`)
.fontSize(12)
.fontColor(valueB > valueA ? '#4CAF50' : '#D0021B')
.layoutWeight(1)
.textAlign(TextAlign.End)
}
.width('100%')
.padding({ top: 8, bottom: 8 })
.borderRadius(8)
.backgroundColor('rgba(255,255,255,0.04)')
.margin({ bottom: 4 })
}
@Builder
ABTestView() {
if (this.abResults.length > 0) {
Scroll() {
Column() {
Text('A/B测试结果')
.fontSize(16)
.fontColor('#E0E0E0')
.fontWeight(FontWeight.Bold)
.margin({ bottom: 12 })
ForEach(this.abResults, (result: ExperimentResult) => {
this.ABResultCard(result)
})
}
.padding({ left: 16, right: 16, top: 8 })
}
.layoutWeight(1)
} else {
Column() {
Text('点击"运行评估分析"查看A/B测试结果')
.fontSize(14).fontColor('#999999')
}
.width('100%').height('50%')
.justifyContent(HorizontalAlign.Center)
}
}
@Builder
ABResultCard(result: ExperimentResult) {
Column() {
Row() {
Text(result.metricName)
.fontSize(15)
.fontColor('#E0E0E0')
.fontWeight(FontWeight.Medium)
if (result.isSignificant) {
Text('显著 ✓')
.fontSize(11)
.fontColor('#4CAF50')
.padding({ left: 6, right: 6, top: 2, bottom: 2 })
.borderRadius(4)
.backgroundColor('rgba(76,175,80,0.15)')
.margin({ left: 8 })
} else {
Text('不显著')
.fontSize(11)
.fontColor('#999999')
.padding({ left: 6, right: 6, top: 2, bottom: 2 })
.borderRadius(4)
.backgroundColor('rgba(255,255,255,0.06)')
.margin({ left: 8 })
}
}
.margin({ bottom: 8 })
Row() {
Column() {
Text('对照组')
.fontSize(11).fontColor('#888888')
Text(result.controlMean.toFixed(4))
.fontSize(14).fontColor('#F5A623').fontWeight(FontWeight.Medium)
}
.layoutWeight(1)
Column() {
Text('实验组')
.fontSize(11).fontColor('#888888')
Text(result.treatmentMean.toFixed(4))
.fontSize(14).fontColor('#7B68EE').fontWeight(FontWeight.Medium)
}
.layoutWeight(1)
Column() {
Text('变化')
.fontSize(11).fontColor('#888888')
Text(`${result.relativeDiff > 0 ? '+' : ''}${(result.relativeDiff * 100).toFixed(1)}%`)
.fontSize(14)
.fontColor(result.relativeDiff > 0 ? '#4CAF50' : '#D0021B')
.fontWeight(FontWeight.Medium)
}
.layoutWeight(1)
}
Row() {
Text(`p值: ${result.pValue.toFixed(4)} | 样本: 对照${result.controlSampleSize} / 实验${result.treatmentSampleSize}`)
.fontSize(11).fontColor('#888888')
}
.margin({ top: 6 })
Row() {
Text(`置信区间: [${result.confidenceInterval[0].toFixed(4)}, ${result.confidenceInterval[1].toFixed(4)}]`)
.fontSize(11).fontColor('#888888')
}
.margin({ top: 4 })
}
.width('100%')
.padding(14)
.borderRadius(12)
.backgroundColor('rgba(255,255,255,0.06)')
.backdropBlur(20)
.margin({ bottom: 10 })
}
}
四、踩坑与注意事项
4.1 样本量不足
坑:A/B测试样本量太小,统计检验没有意义。比如只有10个用户参与实验,即使指标差异很大,也可能是随机波动。
解:
- 使用
calculateMinSampleSize预先计算所需样本量 - 经验法则:每个变体至少需要100个以上的独立用户
- 小样本时使用贝叶斯方法替代频率学派方法
// 计算最小样本量
const minSample = this.abTest.calculateMinSampleSize(
0.12, // 基准CTR 12%
0.20, // 期望检测20%的提升
0.95, // 95%置信水平
0.80 // 80%统计功效
)
console.info(`[ABTest] 最小样本量: ${minSample} 每组`)
// 典型结果:约350-400用户每组
4.2 指标定义不一致
坑:不同团队对"点击率"的定义不同——有人用"点击/曝光",有人用"点击/推荐",有人用"点击用户/曝光用户"。
解:
- 在实验配置中明确定义每个指标的计算口径
- 建立指标字典,统一命名和定义
- 区分"率指标"和"均值指标"
// 指标定义规范
interface MetricDefinition {
name: string
description: string
formula: string
type: 'rate' | 'average' | 'count'
unit: string
direction: 'higher_is_better' | 'lower_is_better' | 'neutral'
}
const CTR_METRIC: MetricDefinition = {
name: 'ctr',
description: '点击率',
formula: '点击次数 / 曝光次数',
type: 'rate',
unit: '%',
direction: 'higher_is_better',
}
4.3 辛普森悖论
坑:整体数据显示实验组更好,但分细分群体看,对照组在每个群体中都更好。这就是辛普森悖论。
解:
- 按关键维度(如新老用户、设备类型)分层分析
- 确保实验组和对照组的用户分布一致
- 使用分层随机化而非简单随机化
// 分层分析
function stratifiedAnalysis(
metrics: ExperimentMetrics[],
stratifyKey: string
): Map<string, ExperimentResult> {
const results: Map<string, ExperimentResult> = new Map()
// 按分层维度分组
const groups = groupBy(metrics, m => m.stratifyData?.[stratifyKey] || 'default')
// 对每个分层分别做检验
groups.forEach((groupMetrics, stratum) => {
// ... 执行统计检验
})
return results
}
4.4 实验污染
坑:同一用户在不同设备上可能被分配到不同变体,导致实验数据污染。
解:
- 使用用户ID(而非设备ID)作为分组依据
- 利用HarmonyOS的分布式账号体系确保跨设备一致性
- 设置实验冷却期,避免用户短时间内切换变体
4.5 评估指标的权衡
坑:优化一个指标可能损害另一个指标。例如,提高点击率可能降低用户满意度。
解:
- 建立指标护栏(guardrail metrics):核心指标不能下降
- 使用综合评分(OEC,Overall Evaluation Criterion)
- 长期指标和短期指标结合看
// 护栏指标检查
function checkGuardrails(
results: ExperimentResult[],
guardrails: Array<{ metric: string; minThreshold: number }>
): boolean {
for (const guardrail of guardrails) {
const result = results.find(r => r.metricName === guardrail.metric)
if (result && result.treatmentMean < guardrail.minThreshold) {
console.warn(`[ABTest] 护栏指标 ${guardrail.metric} 未达标: ${result.treatmentMean} < ${guardrail.minThreshold}`)
return false
}
}
return true
}
五、HarmonyOS 6适配
5.1 版本差异
| 特性 | HarmonyOS 5.0 | HarmonyOS 6 |
|---|---|---|
| 数据分析 | 手动实现 | 新增Analysis Kit |
| 实验配置 | 本地存储 | 云端实验平台集成 |
| 数据上报 | HTTP请求 | 增强的数据上报API |
| 隐私保护 | 基础匿名化 | 差分隐私支持 |
5.2 迁移指南
1. 差分隐私数据上报
HarmonyOS 6支持差分隐私,在数据上报前添加噪声保护用户隐私:
// HarmonyOS 6 差分隐私(概念代码)
import { privacy } from '@kit.PrivacyKit'
function reportMetricWithDP(value: number, epsilon: number = 1.0): number {
// 拉普拉斯机制添加噪声
const sensitivity = 1.0 // 数据敏感度
const scale = sensitivity / epsilon
const noise = laplaceNoise(scale)
return value + noise
}
function laplaceNoise(scale: number): number {
const u = Math.random() - 0.5
return -scale * Math.sign(u) * Math.log(1 - 2 * Math.abs(u))
}
2. 云端实验平台集成
HarmonyOS 6可以与云端实验平台对接,实现远程实验配置:
// HarmonyOS 6 云端实验(概念代码)
import { experiment } from '@kit.AnalysisKit'
async function syncExperimentConfig(): Promise<void> {
// 从云端拉取实验配置
const configs = await experiment.fetchExperiments({
appId: 'com.example.app',
userId: getCurrentUserId(),
})
// 注册到本地A/B测试框架
for (const config of configs) {
abTest.registerExperiment(config)
}
}
3. 增强的数据上报
HarmonyOS 6提供了更高效的数据上报API:
// HarmonyOS 6 数据上报(概念代码)
import { analytics } from '@kit.AnalysisKit'
// 批量上报指标数据
const reporter = analytics.getReporter({
appId: 'com.example.app',
batchSize: 50,
flushInterval: 30000,
})
reporter.report({
event: 'recommendation_metric',
params: {
experiment_id: 'exp_001',
variant: 'treatment',
ctr: 0.15,
timestamp: Date.now(),
},
})
六、总结
本文完整实现了HarmonyOS端侧的推荐效果评估与A/B测试框架,核心知识点回顾:
| 模块 | 核心功能 | 关键技术 |
|---|---|---|
| 评估指标 | Precision/Recall/NDCG/MRR等 | 排序质量评估、覆盖率分析 |
| A/B测试框架 | 实验分组、指标采集、统计检验 | 一致性哈希、双样本t检验 |
| 统计检验 | p值计算、置信区间 | 正态分布近似、样本量估算 |
| 可视化 | 指标对比、实验结果展示 | 算法A/B对比、显著性标注 |
核心要点回顾:
- 📊 评估指标体系分四层:准确度、排序质量、多样性、业务指标,缺一不可
- 🎯 NDCG是最重要的排序指标,考虑了推荐位置的权重差异
- 🧪 A/B测试三要素:随机分组、单一变量、统计显著性
- 📐 样本量是A/B测试的生命线,每组至少100+用户,用公式预计算
- 🛡️ 护栏指标防止"赢了点击率、输了用户体验"的悲剧
- 🔒 差分隐私让数据上报更安全,HarmonyOS 6原生支持
至此,推荐系统系列5篇文章全部完成!从推荐算法原理到协同过滤、内容推荐、实时推荐,再到效果评估,我们完整覆盖了HarmonyOS端侧推荐系统的全链路开发。
- 点赞
- 收藏
- 关注作者
评论(0)