xgboost系列丨xgboost建树过程分析及代码实现
前面我们通过对论文中的公式详细解读,一步步推导了XGBoost的优化目标以及建树方法。下面我们就来动手实践,拿真实的数据来手动计算,并且使用python来实现一个简易的XGBoost。
01
手动计算还原xgboost的过程
XGBoost
的建树过程包括下面几个关键步骤。
-
计算分裂前样本的
G,H
(每个样本的g,h
求和) -
贪心枚举每个特征每个值做为分隔条件
-
计算分隔后左右节点的
G_l,H_l,G_r,H_r
,算出Gain
-
更新最大增益
Gain_max
,更新分隔点。 -
最终得到最优分隔点。
-
根据上述过程递归建树直到终止条件。
-
计算叶节点的权重w
在建完一个树后,再建下棵树时,注意G/H/Gain
的计算用到的预测值要进行更新,对之前的所有树的预测值求和可以得到建下棵树时的。
这里我们使用一个简单的UCI数据集 http://archive.ics.uci.edu/ml/datasets/Container+Crane+Controller+Data+Set
数据集的样本条数只有15条,2个特征。具体如下:
<< 左右滑动查看更多 >>
import pandas as pd
df = pd.read_csv('1.csv',index_col=0)
# df = pd.read_clipboard(index_col=0) 可以直接复制下面这个表格后使用read_clipboard读成DataFrame
对于二分类问题概率是预测值经过Sigmoid函数变换后得到的,默认预测概率为0.5,也就是默认的预测值为0。
<< 左右滑动查看更多 >>
def log_loss_obj(preds, labels):
preds = 1.0 / (1.0 + np.exp(-preds))
grad = preds - labels
hess = preds * (1.0 - preds)
return grad, hess
base_predict = np.zeros_like(df.y)
g,h = log_loss_obj(base_predict,df.y.values) # 计算每个样本的g和h
df['g'], df['h'] = g,h
df
x1 | x2 | y | g | h | |
---|---|---|---|---|---|
ID | |||||
1 | 1 | -5 | 0 | 0.5 | 0.25 |
2 | 2 | 5 | 0 | 0.5 | 0.25 |
3 | 3 | -2 | 1 | -0.5 | 0.25 |
4 | 1 | 2 | 1 | -0.5 | 0.25 |
5 | 2 | 0 | 1 | -0.5 | 0.25 |
6 | 6 | -5 | 1 | -0.5 | 0.25 |
7 | 7 | 5 | 1 | -0.5 | 0.25 |
8 | 6 | -2 | 0 | 0.5 | 0.25 |
9 | 7 | 2 | 0 | 0.5 | 0.25 |
10 | 6 | 0 | 1 | -0.5 | 0.25 |
11 | 8 | -5 | 1 | -0.5 | 0.25 |
12 | 9 | 5 | 1 | -0.5 | 0.25 |
13 | 10 | -2 | 0 | 0.5 | 0.25 |
14 | 8 | 2 | 0 | 0.5 | 0.25 |
15 | 9 | 0 | 1 | -0.5 | 0.25 |
首先对特征x1上的不同的值进行枚举分裂增益。特征一总共有以下几个取值:
sorted(df.x1.unique())
# [1, 2, 3, 6, 7, 8, 9, 10]
对x1
的取值进行排序后,最小值为1,显然没有样本的x1
特征值<1,以x1
划分相当于没有进行划分,因此从x1=2
进行划分。
<< 左右滑动查看更多 >>
def split_data(df, split_feature, split_value):
left_instance = df[df[split_feature] < split_value]
right_instance = df[df[split_feature] >= split_value]
return left_instance, right_instance
left_instance, right_instance = split_data(df,'x1',2)
left_instance.index.tolist(),right_instance.index.tolist()
# ([1, 4], [2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
可以看到样本1,4
被划分到了左侧,其余样本划分到了右侧。对划分后的数据计算G/H
,并求出分裂后的增益Gain
为0.055727554179566596
。
<< 左右滑动查看更多 >>
reg_lambda = 1
G,H = df.g.sum(), df.h.sum() # 分裂前全部样本的G/H
for thresh_values in sorted(df.x1.unique())[1:]:
G_left, H_left = left_instance[['g', 'h']].sum() # 分裂后的G_l,H_l
G_right, H_right = right_instance[['g', 'h']].sum() # 分裂后的G_r,H_r
Gain = G_left**2/(H_left+reg_lambda)+G_right**2 / \
(H_right+reg_lambda)-G**2/(H+reg_lambda) # 分裂后的增益计算
对x1
每个取值进行划分,得到下面不同划分下的增益值。
G_left | H_left | G_right | H_right | Gain | |
---|---|---|---|---|---|
2 | 0.0 | 0.50 | -1.5 | 3.25 | 0.055728 |
3 | 0.0 | 1.00 | -1.5 | 2.75 | 0.126316 |
6 | -0.5 | 1.25 | -1.0 | 2.50 | -0.076859 |
7 | -1.0 | 2.00 | -0.5 | 1.75 | -0.049442 |
8 | -1.0 | 2.50 | -0.5 | 1.25 | -0.076859 |
9 | -1.0 | 3.00 | -0.5 | 0.75 | -0.080827 |
10 | -2.0 | 3.50 | 0.5 | 0.25 | 0.615205 |
同样,对于特征x2
每个取值的分裂情况:
G_left | H_left | G_right | H_right | Gain | |
---|---|---|---|---|---|
-2 | -0.5 | 0.75 | -1.0 | 3.00 | -0.080827 |
0 | 0.0 | 1.50 | -1.5 | 2.25 | 0.218623 |
2 | -1.5 | 2.25 | 0.0 | 1.50 | 0.218623 |
5 | -1.0 | 3.00 | -0.5 | 0.75 | -0.080827 |
我们可以看到当最大的增益为特征x1=10
时,增益为0.615205
,因此将x1<10
作为第一个分裂条件。
接下来开始进行第2个分裂条件的选择:
<< 左右滑动查看更多 >>
use_instance,_ = split_data(df,'x1',10)
G,H = use_instance.g.sum(), use_instance.h.sum()
for thresh_values in sorted(use_instance.x1.unique())[1:]:
left_instance, right_instance = split_data(use_instance,'x1',thresh_values)
G_left, H_left = left_instance[['g', 'h']].sum()
G_right, H_right = right_instance[['g', 'h']].sum()
Gain = G_left**2/(H_left+reg_lambda)+G_right**2 / \
(H_right+reg_lambda)-G**2/(H+reg_lambda)
对于特征x1
:
G_left | H_left | G_right | H_right | Gain | |
---|---|---|---|---|---|
2 | 0.0 | 0.50 | -2.0 | 3.00 | 0.111111 |
3 | 0.0 | 1.00 | -2.0 | 2.50 | 0.253968 |
6 | -0.5 | 1.25 | -1.5 | 2.25 | -0.085470 |
7 | -1.0 | 2.00 | -1.0 | 1.50 | -0.155556 |
8 | -1.0 | 2.50 | -1.0 | 1.00 | -0.103175 |
9 | -1.0 | 3.00 | -1.0 | 0.50 | 0.027778 |
对于特征x2
:
G_left | H_left | G_right | H_right | Gain | |
---|---|---|---|---|---|
-2 | -0.5 | 0.75 | -1.5 | 2.75 | -0.146032 |
0 | -0.5 | 1.25 | -1.5 | 2.25 | -0.085470 |
2 | -2.0 | 2.00 | 0.0 | 1.50 | 0.444444 |
5 | -1.5 | 2.75 | -0.5 | 0.75 | -0.146032 |
因此最优的划分为特征x2<2
。
通过不断的对上述过程迭代,即可递归地建出第一棵树。
02
简易版XGBoost实现
XGBoost
的开源实现来构建模型,和下面我们的实现做对比。
<< 左右滑动查看更多 >>
import xgboost as xgb
model = xgb.XGBClassifier(n_estimators=2,max_depth=3,learning_rate=0.1,random_state=1,reg_lambda=1,gamma=0,objective='binary:logistic',min_child_weight=0,)
model.fit(df[['x1','x2']],df.y)
xgb.to_graphviz(model,num_trees=0)
可以看到第一棵树的结构如下:
第二棵树的结构如下:
xgb.to_graphviz(model,num_trees=1)
下面我们来用Python
实现自己的简易版XGBoost
。
首先创建节点类,通过节点的递归构建出一棵树。
<< 左右滑动查看更多 >>
from graphviz import Digraph
class Node(object):
"""结点
leaf_value : 记录叶子结点值
split_feature :分裂节点对应的特征名
split_value : 分裂节点对应的特征的值
left : 左子树
right : 右子树
"""
def __init__(self, leaf_value=None, split_feature=None, split_value=None, left=None, right=None):
self.leaf_value = leaf_value
self.split_feature = split_feature
self.split_value = split_value
self.left = left
self.right = right
def show(self):
print(
f'weight: {self.leaf_value}, split_feature: {self.split_feature}, split_value: {self.split_value}.')
def visualize_tree(self):
"""
使用graphviz递归绘制树
"""
def add_nodes_edges(self, dot=None):
if dot is None:
dot = Digraph()
dot.node(name=str(self),
label=f'{self.split_feature}<{self.split_value}')
# Add nodes
if self.left:
if self.left.leaf_value:
dot.node(name=str(self.left),
label=f'leaf={self.left.leaf_value:.10f}')
else:
dot.node(name=str(self.left),
label=f'{self.left.split_feature}<{self.left.split_value}')
dot.edge(str(self), str(self.left))
dot = add_nodes_edges(self.left, dot=dot)
if self.right:
if self.right.leaf_value:
dot.node(name=str(self.right),
label=f'leaf={self.right.leaf_value:.10f}')
else:
dot.node(name=str(self.right),
label=f'{self.right.split_feature}<{self.right.split_value}')
dot.edge(str(self), str(self.right))
dot = add_nodes_edges(self.right, dot=dot)
return dot
dot = add_nodes_edges(self)
return dot
注意这里我们写了一个visualize_tree
方法,使用graphviz
来绘制建好的树的结构,在XGBoost
的开源实现中也是使用类似的方案。该方法不是必须的,但是却对我们了解建树过程有着很好的帮助。同时我们使用show
方法将节点的信息也字符形式显示出来。
下面来看一下损失函数和建树过程的实现:
<< 左右滑动查看更多 >>
# 树的结构参数
reg_lambda = 1 # 叶节点权重L2正则系数
min_samples_split = 1 # 分裂所需的最小样本个数
max_depth = 3 # 树的深度
# 建树过程参数
learning_rate = 0.1 # 学习率
n_estimators = 2 # 树的个数
# log损失函数
def log_loss_obj(preds, labels):
# preds是建该树之前模型的输出,对于二分类问题需要的是概率,因此将该值经过Sigmoid转换
probs = 1.0 / (1.0 + np.exp(-preds))
grad = probs - labels
hess = probs * (1.0 - probs)
return grad, hess
def build_tree(df, feature_names, depth=1):
df = df.copy()
df['g'], df['h'] = log_loss_obj(df.y_pred, df.y)
G, H = df[['g', 'h']].sum()
Gain_max = float('-inf')
# 终止条件 当前节点个数小于分裂所需最小样本个数,深度大于max_depth,叶节点只有一类样本无需再分
if df.shape[0] > min_samples_split and depth <= max_depth and df.y.nunique() > 1:
for feature in feature_names: # 遍历每个特征
thresholds = sorted(set(df[feature])) # 特征取值排序
for thresh_value in thresholds[1:]: # 遍历每个取值
left_instance = df[df[feature] < thresh_value] # 划分到左右节点的样本
right_instance = df[df[feature] >= thresh_value]
G_left, H_left = left_instance[['g', 'h']].sum()
G_right, H_right = right_instance[['g', 'h']].sum()
Gain = G_left**2/(H_left+reg_lambda)+G_right**2 / \
(H_right+reg_lambda)-G**2/(H+reg_lambda) # 评价划分的增益效果
if Gain >= Gain_max:
Gain_max = Gain
split_feature = feature # 最大增益对应的分裂特征
split_value = thresh_value # 最大增益对应的分裂值
left_data = left_instance # 最大增益对应的分裂后左节点样本
right_data = right_instance # 最大增益对应的分裂后右节点样本
left = build_tree(left_data, feature_names, depth+1) # 递归建左子树
right = build_tree(right_data, feature_names, depth+1)# 递归建右子树
return Node(split_feature=split_feature, split_value=split_value, left=left, right=right) # 返回分裂节点
return Node(leaf_value=-G/(H+reg_lambda)*learning_rate) # 分裂终止,返回叶节点权重
对于单棵树的预测,我们使用predict
函数,根据树的分裂条件以及当前数据的信息来确认每个样本被分到当前这棵树的哪个叶节点,得到对应叶节点的权重。
<< 左右滑动查看更多 >>
def predict(x, tree):
# 递归每个分裂点直到样本对应的叶节点
# 终止条件:叶节点
if tree.leaf_value is not None:
return tree.leaf_value
if x[tree.split_feature] < tree.split_value:
return predict(x, tree.left)
else:
return predict(x, tree.right)
根据已建好的树来更新预测结果,进而确定新建树的结构,不断迭代最终得到全部的树:
<< 左右滑动查看更多 >>
trees = []
y_pred = 0 # 初始预测值
for i in range(n_estimators):
df['y_pred'] = y_pred
tree = build_tree(df, feature_names=['x1', 'x2'])
data_weight = df[['x1', 'x2']].apply(
predict, tree=tree, axis=1)
y_pred += data_weight
trees.append(tree)
对于已建好的树,使用我们前面定义的visualize_tree
函数,可以方便的看到树的结构。
第一棵数的结构:
trees[0].visualize_tree()
第二棵树的结构:
trees[1].visualize_tree()
可以看到,我们自己实现的XGBoost
与开源的XGBoost
得到的树的结构是一致的。
03
封 装
<< 左右滑动查看更多 >>
import numpy as np
import pandas as pd
import xgboost as xgb
from graphviz import Digraph
class Node(object):
"""结点
leaf_value : 记录叶子结点值
split_feature :特征i
split_value : 特征i的值
left : 左子树
right : 右子树
"""
def __init__(self, leaf_value=None, split_feature=None, split_value=None, left=None, right=None):
self.leaf_value = leaf_value
self.split_feature = split_feature
self.split_value = split_value
self.left = left
self.right = right
def show(self):
print(
f'weight: {self.leaf_value}, split_feature: {self.split_feature}, split_value: {self.split_value}.')
def visualize_tree(self):
"""
递归查找绘制树
"""
def add_nodes_edges(self, dot=None):
if dot is None:
dot = Digraph()
dot.node(name=str(self),
label=f'{self.split_feature}<{self.split_value}')
# Add nodes
if self.left:
if self.left.leaf_value:
dot.node(name=str(self.left),
label=f'leaf={self.left.leaf_value:.10f}')
else:
dot.node(name=str(self.left),
label=f'{self.left.split_feature}<{self.left.split_value}')
dot.edge(str(self), str(self.left))
dot = add_nodes_edges(self.left, dot=dot)
if self.right:
if self.right.leaf_value:
dot.node(name=str(self.right),
label=f'leaf={self.right.leaf_value:.10f}')
else:
dot.node(name=str(self.right),
label=f'{self.right.split_feature}<{self.right.split_value}')
dot.edge(str(self), str(self.right))
dot = add_nodes_edges(self.right, dot=dot)
return dot
dot = add_nodes_edges(self)
return dot
def log_loss_obj(preds, labels):
preds = 1.0 / (1.0 + np.exp(-preds))
grad = preds - labels
hess = preds * (1.0 - preds)
return grad, hess
def mse_obj(preds, labels):
grad = labels-y_pred
hess = np.ones_like(labels)
return grad, hess
class XGB:
def __init__(self, n_estimators=2, learning_rate=0.1, max_depth=3, min_samples_split=0, reg_lambda=1, base_score=0.5, loss=log_loss_obj):
# 学习控制参数
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.base_score = base_score
# 树参数
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.loss = loss
self.reg_lambda = reg_lambda
self.trees = []
self.feature_names = None
def sigmoid_array(self, x):
return 1 / (1 + np.exp(-x))
def _predict(self, x, tree):
# 循环终止条件:叶节点
if tree.leaf_value is not None:
return tree.leaf_value
if x[tree.split_feature] < tree.split_value:
return self._predict(x, tree.left)
else:
return self._predict(x, tree.right)
def _build_tree(self, df, depth=1):
df = df.copy()
df['g'], df['h'] = self.loss(df.y_pred, df.y)
G, H = df[['g', 'h']].sum()
Gain_max = float('-inf')
if df.shape[0] > self.min_samples_split and depth <= self.max_depth and df.y.nunique() > 1:
for feature in self.feature_names:
thresholds = sorted(set(df[feature]))
for thresh_value in thresholds[1:]:
left_instance = df[df[feature] < thresh_value]
right_instance = df[df[feature] >= thresh_value]
G_left, H_left = left_instance[['g', 'h']].sum()
G_right, H_right = right_instance[['g', 'h']].sum()
Gain = G_left**2/(H_left+self.reg_lambda)+G_right**2 / \
(H_right+self.reg_lambda)-G**2/(H+self.reg_lambda)
if Gain >= Gain_max:
Gain_max = Gain
split_feature = feature
split_value = thresh_value
left_data = left_instance
right_data = right_instance
# print(feature,'Gain:',Gain,'G-Left',G_left,'H-left',H_left,'G-Right',G_right,'H-right',H_right,'----',thresh_value)
# print(Gain_max,split_feature,split_value)
left = self._build_tree(left_data, depth+1)
right = self._build_tree(right_data, depth+1)
return Node(split_feature=split_feature, split_value=split_value, left=left, right=right)
return Node(leaf_value=-G/(H+self.reg_lambda)*self.learning_rate)
def fit(self, X, y):
y_pred = -np.log((1/self.base_score)-1)
df = pd.DataFrame(X)
df['y'] = y
self.feature_names = df.columns.tolist()[:-1]
for i in range(self.n_estimators):
df['y_pred'] = y_pred
tree = self._build_tree(df)
data_weight = df[['x1', 'x2']].apply(
self._predict, tree=tree, axis=1)
y_pred += data_weight
self.trees.append(tree)
def predict(self, X):
df = pd.DataFrame(X)
y_pred = -np.log((1/self.base_score)-1)
for tree in self.trees:
df['y_pred'] = y_pred
data_weight = df[['x1', 'x2']].apply(
self._predict, tree=tree, axis=1)
y_pred += data_weight
return self.sigmoid_array(y_pred)
def __repr__(self):
return 'XGBClassifier('+', '.join(f'{k}={v}' for k, v in self.__dict__.items() if not k.startswith('_'))+')'
使用前面同样的方法,调用我们自己封装好的代码,可以看到很简洁的实现了跟官方开源实现类似的效果。
df = pd.read_csv('1.csv',index_col=0)
model = XGB()
model.fit(df[['x1', 'x2']], df.y)
model.predict(df[['x1', 'x2']])
model.trees[0].visualize_tree()
当然这里只是一个DEMO,来帮助我们更好的理解论文中的公式以及XGBoost
的实现过程,更多的细节优化可以参考官方实现:https://github.com/dmlc/xgboost
04
参 考
https://xgboost.readthedocs.io/en/latest/tutorials/model.html
https://blog.csdn.net/qq_22238533/article/details/79477547
https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py
https://github.com/lxmly/machine-learning/blob/master/decision_tree.py
声明:本文首发于华为NAIE《网络人工智能园地》微信公众号,如有转载,请注明出处。微信公众号二维码为:
- 点赞
- 收藏
- 关注作者
评论(0)