网站首页 | 期刊介绍 | 编委会 | 投稿指南 | 在线订阅 | 联系我们 | 同行评议 | 出版声明 | 征稿English
陆万荣,许江淳,李玉惠.考虑边界稀疏样本的非平衡数据处理方法[J].重庆邮电大学学报(自然科学版),2020,32(3):495-502. 本文二维码信息
二维码(扫一下试试看!)
考虑边界稀疏样本的非平衡数据处理方法
Unbalanced data processing method considering boundary sparse samples
投稿时间:2019-01-03  修订日期:2020-04-27
DOI: 10.3979/j.issn.1673-825X.2020.03.020
中文关键词:  非平衡数据  混合采样  边界因子  SMOTE算法
English Keywords:unbalanced data  mixed sampling  boundary factor  SMOTE algorithm
基金项目:国家自然科学基金(61363043)
作者单位E-mail
陆万荣 昆明理工大学 信息工程与自动化学院,昆明 650500 lwr6608@163.com 
许江淳 昆明理工大学 信息工程与自动化学院,昆明 650500 jx19631018@163.com 
李玉惠 昆明理工大学 信息工程与自动化学院,昆明 650500 1484946977@qq.com 
摘要点击次数: 96
全文下载次数: 58
中文摘要:
      针对现有非平衡数据处理方法存在的局限性,提出一种考虑边界4稀疏样本的混合采样方法(considering boundary sparse samples-hybrid sampling,CBSS-HS)。通过计算每个样本的边界因子识别边界点,将样本空间划分为边界域和非边界域,对非边界域内的负类样本进行欠采样,而由于边界域上样本的稀疏性,对其上正类样本使用基于最大距离的合成少数类过采样技术(max distance-synthetic minority oversampling technique,MD-SMOTE)进行过采样,最大限度地保留正类样本的信息,最终达到2类样本基本平衡。将Recall,F1-value,G-mean和AUC(area under the curve)值作为评价指标,使用CBSS-HS+支持向量机(support vector machines,SVM)算法在5个不同平衡度的数据集上验证其有效性,并与其他4种组合模型的分类效果做对比。结果表明,提出的CBSS-HS算法在不同数据集上各个评价指标都有良好的表现,平均提高了4.6%。因此,该方法可以作为处理非平衡数据的一种有效手段。
English Summary:
      Aiming at the limitations of existing unbalanced data processing methods, a hybrid sampling method considering boundary sparse samples is proposed. By calculating the boundary factor of each sample to identify the boundary points, the sample space is divided into boundary and non-boundary domains, and negative samples in the non-boundary domain are undersampled, and due to the sparsity of the samples on the boundary domain.MD-SMOTE algorithm,SMOTE based max distance,is performed for positive samples on it,which maximumly reserve the information of positive samples and ultimately achieves a basic balance between the two types of samples. The Recall, F1-value, G-mean and AUC values were used as evaluation indicators. The CBSS-HS+SVM algorithm was used to verify the validity of the datasets on five different balances, and compared with the classification model of the other four combined models. The results show that the CBSS-HS algorithm presented in this paper has a good performance on different evaluation indexes,with an average increase of 4.6%. Therefore, this method can be used as an effective means of processing unbalanced data.
HTML    PDF浏览   查看/发表评论  下载PDF阅读器
版权所有 © 2009 重庆邮电大学期刊社  
地址:重庆市 南岸区 重庆邮电大学 期刊社 邮编:400065
电话:023-62461032 E-mail : journal@cqupt.edu.cn
meinv 海贼王论坛