第九章集成学习 Boosting案例：信用卡欺诈分类

哈伦2019

535人浏览 · 2026-05-26 20:52:05

哈伦2019 · 2026-05-26 20:52:05 发布

案例：信用卡欺诈分类

在这里插入图片描述

案例背景

数据集包含2013年9月欧洲持卡人的信用卡交易。
该数据集显示了两天内发生的交易，其中284,807宗交易中只有492个欺诈。
数据集高度不平衡，正类(欺诈交易）仅占所有交易的0.172%。

它只包含数值输入变量，这是一个PCA变换的结果。

出于保密问题，没有提供原始特征和更多关于数据的背景信息。

特征V1, V2，…V28为主成分分析(PCA)得到的主成分;
唯一没有使用PCA转换的特征是时间和数量。
Feature Time包含每个事务与数据集中的第一个事务之间所经过的秒数。
特征Amount是指交易金额，此特征可用于示例依赖的成本敏感学习。

Feature Class是标签变量，如果发生欺诈，它的值为1，否则为0。

数据读取与划分

import pandas as pd 
import numpy as np
import matplotlib
from IPython.display import Image
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

import gc
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score , f1_score ,roc_auc_score
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier
from xgboost import XGBClassifier

pd.set_option('display.max_columns', 100)

#TRAIN/TEST SPLIT
TEST_SIZE = 0.20 # test size using_train_test_split

RANDOM_STATE = 42

接着读取数据，并且输出数据的信息

data = pd.read_csv("./creditcard.csv")

print("Credit Card Fraud Detection data -  rows:",data.shape[0]," columns:", data.shape[1])

Credit Card Fraud Detection data -  rows: 284807  columns: 31

接着使用head观察数据

data.head()

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	Amount
0	0.0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	0.090794	-0.551600	-0.617801	-0.991390	-0.311169	1.468177	-0.470401	0.207971	0.025791	0.403993	0.251412	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62
1	0.0	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	-0.166974	1.612727	1.065235	0.489095	-0.143772	0.635558	0.463917	-0.114805	-0.183361	-0.145783	-0.069083	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
2	1.0	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	0.207643	0.624501	0.066084	0.717293	-0.165946	2.345865	-2.890083	1.109969	-0.121359	-2.261857	0.524980	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66
3	1.0	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	-0.054952	-0.226487	0.178228	0.507757	-0.287924	-0.631418	-1.059647	-0.684093	1.965775	-1.232622	-0.208038	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50
4	2.0	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	0.753074	-0.822843	0.538196	1.345852	-1.119670	0.175121	-0.451449	-0.237033	-0.038195	0.803487	0.408542	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99

从数据结果中看到数据集中共284,807条记录

查看数据集中的数据缺失情况

total = data.isnull().sum().sort_values(ascending = False)
percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent']).transpose()

	Time	V16	Amount	V28	V27	V26	V25	V24	V23	V22	V21	V20	V19	V18	V17	V15	V1	V14	V13	V12	V11	V10	V9	V8	V7	V6	V5	V4	V3	V2	Class
Total	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Percent	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

从上面的结果来看，数据集中不存在缺失数据

按照是否被欺诈进行分类可视化

temp = data["Class"].value_counts()
df = pd.DataFrame({'Class': temp.index,'values': temp.values})

trace = go.Bar(
    x = df['Class'],y = df['values'],
    name="Credit Card Fraud Class - data unbalance (Not fraud = 0, Fraud = 1)",
    marker=dict(color="Blue"),
    text=df['values']
)
temp_data = [trace]
layout = dict(title = 'Credit Card Fraud Class - data unbalance (Not fraud = 0, Fraud = 1)',
          xaxis = dict(title = 'Class', showticklabels=True), 
          yaxis = dict(title = 'Number of transactions'),
          hovermode = 'closest',width=600
         )
fig = dict(data=temp_data, layout=layout)
iplot(fig, filename='class')

在这里插入图片描述

从可视化结果来看，数据集中是存在数据不平衡性的，只有492条诈骗记录(0.172%)

接下来对欺诈案例按照时间维度进行可视化

class_0 = data.loc[data['Class'] == 0]["Time"]
class_1 = data.loc[data['Class'] == 1]["Time"]

hist_data = [class_0, class_1]
group_labels = ['Not Fraud', 'Fraud']

fig = ff.create_distplot(hist_data, group_labels, show_hist=False, show_rug=False)
fig['layout'].update(title='Credit Card Transactions Time Density Plot', xaxis=dict(title='Time [s]'))
iplot(fig, filename='dist_only')

在这里插入图片描述

从结果来看，欺诈交易分布的较为均匀，且容易在夜间持续发生。

接下来将将数据分为训练集和测试集

train_df, test_df = train_test_split(data, test_size=TEST_SIZE, random_state=RANDOM_STATE, shuffle=True)

由于数据集中存在数据不平衡的现象，因此需要对训练数据进行平衡，通过对数量较多的非欺诈样本进行欠采样，将欺诈样本的比例提升到1%

# 获得欺诈样本的数量
train_fraud_df  = train_df[train_df['Class'] ==1]
no_of_fraud = train_fraud_df.shape[0]

# 对非欺诈样本进行欠采样
no_of_non_fraud = no_of_fraud * 99
train_non_fraud_df = train_df[train_df['Class'] ==0].sample( no_of_non_fraud , random_state =RANDOM_STATE)
no_of_non_fraud = train_non_fraud_df.shape[0]

# 将欠采样后的数据进行整合，并且对数据的顺序进行打乱

train_df = pd.concat([train_fraud_df, train_non_fraud_df] , axis =0 )
train_df = train_df.sample(frac = 1,random_state =RANDOM_STATE)

Total Fraud in Train Data : 394
Total non Fraud in Train Data : 39006

target = 'Class'
predictors = ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',\
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',\
       'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28',\
       'Amount']

AdaBoost模型搭建与训练

ada_clf = AdaBoostClassifier(random_state=RANDOM_STATE)

ada_clf.fit(train_df[predictors], train_df[target].values)

test_df['prediction'] = ada_clf.predict(test_df[predictors])

cm = pd.crosstab(test_df[target].values, test_df['prediction'], rownames=['Actual'], colnames=['Predicted'])
fig, ax1 = plt.subplots(ncols=1, figsize=(7,7))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues" , fmt='d')
plt.title('Confusion Matrix', fontsize=16)
plt.show()

在这里插入图片描述

metric_data = pd.DataFrame(columns =['Model Name','Detection Rate' ,'AUC','F1 Score','Accuracy','Fraud Loss Saved'])

# we will use original data as Amount is transformed for modelling 
def fraud_loss_saved ( dataset , key) :

    df = dataset.copy()
    total_fraud_amt = df[df['Class'] ==1]['Amount'].sum()
    print("Total Fraud Amount in Test Data : " +  str(round(total_fraud_amt,2)))
    total_fraud_amt_detected = df.loc[(df['prediction'] ==1) & (df['Class']==1) ]['Amount'].sum()
    print("Total Fraud Amount Detected in Test Data : " +  str(round(total_fraud_amt_detected,2)))
    print("Fraud Loss Saved (%): " + str(round(100*total_fraud_amt_detected/total_fraud_amt ,2)))
    detection_rate  = 100 * (df[df['prediction']==1]['Class'].sum())/df['Class'].sum()
    print("Detection Rate (%) : " + str(round(detection_rate , 2)))
    accuracy = 100*accuracy_score(df['Class'] ,df['prediction'])
    print("Accuracy : " + str(round(accuracy ,2)))
    f1 = f1_score(df['Class'] ,df['prediction'])
    print("F1 Score : " + str(round(f1 ,4)))   
    auc_score = roc_auc_score(df['Class'],df['prediction'])
    print("AUC Score : " + str(round(auc_score,4)))
    values = []
    values.append(key)
    values.append(detection_rate)
    values.append(auc_score)
    values.append(f1)
    values.append(accuracy)
    values.append(round(100*total_fraud_amt_detected/total_fraud_amt ,2))
    
    final_values =[]
    final_values.append(values)
    temp_df = pd.DataFrame(final_values ,columns =['Model Name','Detection Rate' ,'AUC','F1 Score','Accuracy','Fraud Loss Saved'])
    
    global metric_data
    
    metric_data = pd.concat([metric_data,temp_df ] , axis = 0 )

fraud_loss_saved(test_df ,'AdaBoost - Test Data')

Total Fraud Amount in Test Data : 16078.4
Total Fraud Amount Detected in Test Data : 12019.13
Fraud Loss Saved (%): 74.75
Detection Rate (%) : 79.59
Accuracy : 99.91
F1 Score : 0.7573
AUC Score : 0.8977

GBDT模型搭建与训练

gbdf_clf = GradientBoostingClassifier(random_state=RANDOM_STATE)

gbdf_clf.fit(train_df[predictors], train_df[target].values)

test_df['prediction'] = gbdf_clf.predict(test_df[predictors])

cm = pd.crosstab(test_df[target].values, test_df['prediction'], rownames=['Actual'], colnames=['Predicted'])
fig, ax1 = plt.subplots(ncols=1, figsize=(7,7))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues" , fmt='d')
plt.title('Confusion Matrix', fontsize=16)
plt.show()

在这里插入图片描述

fraud_loss_saved(test_df ,'GBDT - Test Data')

Total Fraud Amount in Test Data : 16078.4
Total Fraud Amount Detected in Test Data : 12252.02
Fraud Loss Saved (%): 76.2
Detection Rate (%) : 82.65
Accuracy : 99.79
F1 Score : 0.5724
AUC Score : 0.9124

XGBoost模型搭建与训练

xgb_clf = XGBClassifier(random_state=RANDOM_STATE)

xgb_clf.fit(train_df[predictors], train_df[target].values)

test_df['prediction'] = xgb_clf.predict(test_df[predictors])

cm = pd.crosstab(test_df[target].values, test_df['prediction'], rownames=['Actual'], colnames=['Predicted'])
fig, ax1 = plt.subplots(ncols=1, figsize=(7,7))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues" , fmt='d')
plt.show()

在这里插入图片描述

fraud_loss_saved(test_df ,'XGBoost - Test Data')

Total Fraud Amount in Test Data : 16078.4
Total Fraud Amount Detected in Test Data : 12346.92
Fraud Loss Saved (%): 76.79
Detection Rate (%) : 83.67
Accuracy : 99.94
F1 Score : 0.8241
AUC Score : 0.9182

模型对比

metric_data

Model Name	Detection Rate	AUC	F1 Score	Accuracy	Fraud Loss Saved
AdaBoost - Test Data	79.591837	0.897695	0.757282	99.912222	74.75
GBDT - Test Data	82.653061	0.912351	0.572438	99.787578	76.20
XGBoost - Test Data	83.673469	0.918200	0.824121	99.938556	76.79