Chinese Whisper 人脸聚类算法实现
Chinese whispers clustering
目录
1Chinese_Whispers简介
2Algorithm
3Strengths and Weaknesses
4Dlib库调用方法
5分类时间统计
6Python版实现
Chinese_Whispers简介
Chinese Whispers is a clustering method used in network science named after the famous whispering game.[1] Clustering methods are basically used to identify communities of nodes or links in a given network. This algorithm was designed by Chris Biemann and Sven Teresniak in 2005.[1] The name comes from the fact that the process can be modeled as a separation of communities where the nodes send the same type of information to each other.[1]
Chinese Whispers is a hard partitioning, randomized, flat clustering (no hierarchical relations between clusters) method.[1] The random property means that running the process on the same network several times can lead to different results, while because of hard partitioning one node can only belong to one cluster at a given moment. The original algorithm is applicable to undirected, weighted and unweighted graphs. Chinese Whispers is time linear which means that it is extremely fast even if the number of nodes and links are very high in the network.[1]
Algorithm
The algorithm works in the following way in an undirected unweighted graph:
- All nodes are assigned to a random class. The number of initial classes equals the number of nodes.
- Then all of the network nodes are selected one by one in a random order. Every node moves to the class which the given node connects with the most links. In the case of equality the cluster is randomly chosen from the equally linked classes.
- Step two repeats itself until a predetermined number of iteration or until the process converges. In the end the emerging classes represent the clusters of the network.
- The predetermined threshold for the number of the iterations is needed because it is possible, that process does not converge. On the other hand in a network with approximately 10000 nodes the clusters does not change significantly after 40-50 iterations even if there is no convergence.
该算法核心:
- 构建无向图,将每个人脸做为无向图中的一个节点,人脸之间的相似度,作为节点之间的边,如果人脸之间的相似度小于上面设定的阈值
那么.这两个人脸对应的节点之间就没有边,
- 迭代开始时,将每个人脸都赋予一个id,该id作为该人脸的类别,也就是说初始化时,每个人脸都是一个类别
- 开始第一次迭代,随机选取某个节点,对该节点的所有邻居依次进行下面的处理:
- 如果是初始化的时候,由于每个节点都有自己所属的类别,就将所有邻居中权重最大的节点对应的类做为该节点的类别,完成对该节点的类别更新
- 如果迭代到第2次,那么对某个节点,就可能会出现有两个邻居属于同一个类,那么就将同一个类下的邻居权重累加,最后,再看该节点下的所有邻居节点所属的类别的累加权重,取权重最大的类别作为当前节点的类别.
- 当所有的节点都完成后,就完成了一次迭代,重复2步骤,直到达到迭代次数.
该方法基于图进行聚类,将图中一个节点对应一个人脸,节点之间的边对应两个节点的相似度,也就是两个人脸的相似度,通过迭代查找一个节点对应的相似权重累加和来查找类别并进行聚类,使用facenet embedding得到的特征对ms-celeb数据集进行cluster,经过检验,模型和阈值选择适当的时候,只经过10次迭代,就可以达到比较好的效果。 该算法的结果主要依赖于模型的效果和阈值的选择,在迭代时,将相似度作为权重.
Strengths and Weaknesses
The main strength of Chinese Whispers lies in its time linear property. Because of the processing time increases linearly with the number of nodes, the algorithm is capable of identifying communities in a network very fast. For this reason Chinese Whispers is a good tool to analyze community structures in graph with a very high number of nodes. The effectiveness of the method increases further if the network has the small world property.[1]
On the other hand because the algorithm is not deterministic in the case of small node number the resulting clusters often significantly differ from each other. The reason for this is that in the case of a small network it matters more from which node the iteration process starts while in large networks the relevance of starting points disappears.[1] For this reason for small graphs other clustering methods are recommended.
Dlib库调用方法
Chinese Whispers 聚类算法用于当你不知道有多少个类时。他的基本算法步骤是:
- 对于所有节点v,都赋值一个初始的类class(vi)=i
- 随机选取一个节点vt,找到v所有的临接节点,对临接节点所属的类进行打分。例如一个节点1的临接节点有2,3,4,5,分别属于a,b,c,b类别,边1-2,1-3,1-4,1-5的权值都为1,那么类a的得分就是1,类b得分2,类c得分1
- 将得分最高的类别赋值给vt
- 返回2
下面上dlib的代码:
std::vector<sample_pair> edges;
for (size_t i = 0; i < face_descriptors.size(); ++i)
{
for (size_t j = i+1; j < face_descriptors.size(); ++j)
{
// Faces are connected in the graph if they are close enough. Here we check if
// the distance between two face descriptors is less than 0.6, which is the
// decision threshold the network was trained to use. Although you can
// certainly use any other threshold you find useful.
if (length(face_descriptors[i]-face_descriptors[j]) < randis)
edges.push_back(sample_pair(i,j));
}
}
std::vector<unsigned long> labels;
const auto num_clusters = chinese_whispers(edges, labels);
face_descriptors :所有的人脸特征
randis:距离阈值
edges:算法的输入,是一个连接图结构。
labels:是最后的返回值,标明每个样本属于第几个类别。
//函数实现
inline unsigned long chinese_whispers (
const std::vector<ordered_sample_pair>& edges,
std::vector<unsigned long>& labels,
const unsigned long num_iterations,
dlib::rand& rnd
)
{
// make sure requires clause is not broken,传进来的边集需要排好序
DLIB_ASSERT(is_ordered_by_index(edges),
"\t unsigned long chinese_whispers()"
<< "\n\t Invalid inputs were given to this function"
);
labels.clear();
if (edges.size() == 0)
return 0;
std::vector<std::pair<unsigned long, unsigned long> > neighbors;
find_neighbor_ranges(edges, neighbors);
// Initialize the labels, each node gets a different label.
labels.resize(neighbors.size());
for (unsigned long i = 0; i < labels.size(); ++i)
labels[i] = i;
for (unsigned long iter = 0; iter < neighbors.size()*num_iterations; ++iter)
{
// Pick a random node.随机挑选一个节点
const unsigned long idx = rnd.get_random_64bit_number()%neighbors.size();
// Count how many times each label happens amongst our neighbors.对节点的临接几点所属的类别进行统计打分
std::map<unsigned long, double> labels_to_counts;
const unsigned long end = neighbors[idx].second;
for (unsigned long i = neighbors[idx].first; i != end; ++i)
{
labels_to_counts[labels[edges[i].index2()]] += edges[i].distance();
}
// find the most common label.找到得分最高的类,并给该节点归类
std::map<unsigned long, double>::iterator i;
double best_score = -std::numeric_limits<double>::infinity();
unsigned long best_label = labels[idx];
for (i = labels_to_counts.begin(); i != labels_to_counts.end(); ++i)
{
if (i->second > best_score)
{
best_score = i->second;
best_label = i->first;
}
}
labels[idx] = best_label;
}
// Remap the labels into a contiguous range. First we find the
// mapping.因为上述找到的类别可能不是连续的0,1,2,3...,需要对类别进行重新映射为连续的编号
std::map<unsigned long,unsigned long> label_remap;
for (unsigned long i = 0; i < labels.size(); ++i)
{
const unsigned long next_id = label_remap.size();
if (label_remap.count(labels[i]) == 0)
label_remap[labels[i]] = next_id;
}
// now apply the mapping to all the labels.给所有节点赋值类别
for (unsigned long i = 0; i < labels.size(); ++i)
{
labels[i] = label_remap[labels[i]];
}
return label_remap.size();
}
分类时间统计
Python版实现
import networkx
# build nodes and edge lists
nodes = [
(1,{'attr1':1}),
(2,{'attr1':1})
...
]
edges = [
(1,2,{'weight': 0.732})
....
]
# initialize the graph
G = nx.Graph()
# Add nodes
G.add_nodes_from(nodes)
# CW needs an arbitrary, unique class for each node before initialisation
# Here I use the ID of the node since I know it's unique
# You could use a random number or a counter or anything really
for n, v in enumerate(nodes):
G.node[n]['class'] = v
# add edges
G.add_edges_from(edges)
# run Chinese Whispers
# I default to 10 iterations. This number is usually low.
# After a certain number (individual to the data set) no further clustering occurs
iterations = 10
for z in range(0,iterations):
gn = G.nodes()
# I randomize the nodes to give me an arbitrary start point
shuffle(gn)
for node in gn:
neighs = G[node]
classes = {}
# do an inventory of the given nodes neighbours and edge weights
for ne in neighs:
if isinstance(ne, int) :
if G.node[ne]['class'] in classes:
classes[G.node[ne]['class']] += G[node][ne]['weight']
else:
classes[G.node[ne]['class']] = G[node][ne]['weight']
# find the class with the highest edge weight sum
max = 0
maxclass = 0
for c in classes:
if classes[c] > max:
max = classes[c]
maxclass = c
# set the class of target node to the winning local class
G.node[node]['class'] = maxclass
写在前面
近来利用神经网络提取人脸特征的方法越来越多,人脸相似性匹配准确度也越来越高。但仍然没有找到一种适合于未知类别数量,自动划分的方法,而k-means等聚类方法均是要预先设定分类数量后再开始进行聚类操作。
在博客的介绍下,了解了一种比较简单的无监督分类方法,chinese-whisper。
现将具体实验总结如下
CW-算法
【适用场景】
未知具体分类数量,自动查找类别个数并进行快速聚类。
【算法核心】
初始化
构建无向图,以每个节点为一个类别,不同节点之间计算相似度,当相似度超过threshold,将两个节点相连形成关联边,权重为相似度。
迭代
1.随机选取一个节点i
开始,在其邻居中选取边权重最大者j
,并将该点归为节点j
类(若邻居中有多个节点(j,k,l)属于同一类,则将这些节点权重相加再参与比较)。
2.遍历所有节点后,重复迭代至满足迭代次数。
【算法分析】
1.特征向量高要求
从算法介绍可以看出,该算法即是对两两匹配的升级。因而该算法一个很大影响因素即为门限threshold的选取。
算法的准确度又会回归到神经网络的核心要求,增大类间间距,减小类内间距。另外,该算法对于类别数较多的情况下,可能会有较差的结果,即类别越多,当前空间下的特征向量区分性越差。
2.随机性较大
该算法的一个重大缺陷在于其随机性较大。究其原因,每次迭代会随机选取开始节点,因而对于模糊节点而言,不同遍历次序会使该节点被归在不同类别中。
对于上图,正确分类为{1,2},{3,4,5}。然而由于特征向量表现度不够,3节点归类较为模糊。
若遍历次序为1→2→3,节点{1,2}会优先归在同一类,导致3节点有更大可能性被归属于{1,2,3},因为此时{4},{5}仍是独立类别。
若遍历次序为4→5→3,节点{4,5}会优先归在同一类,导致3节点有更大可能性被归属于{3,4,5},因为此时{1},{2}仍是独立类别。
算法测试
【测试来源】
数据集为提取的lfw人脸最多的前19种,网络模型为mtcnn+resnet11。
【测试步骤】
分别选取2,3,4,5类用于分类情况测试,考虑到随机性,每个类别各测试5次。
【详细性能】
测试结果如下图,图中分类错误已用红笔圈出。
2-class
采用2个分类集时,样本数量一共119张,其中0-76属于第一类,77-118属于第二类。
准确率–100%
说明:由于5次测试结果相同,这里不再添加。
3-class
采用3个分类集时,样本数量一共355张,其中0-76属于第一类,77-118属于第二类,119-354属于第三类。
准确率–2个错误
说明:由于5次测试结果相同,这里不再添加。
4-class
采用4个分类集时,样本数量一共476张,其中0-76属于第一类,77-118属于第二类,119-354属于第三类,355-475属于第四类。
准确率–无法恒定。出现2中分类情况,见下图
第一种情况,只分出了3个类别,错误将第四类归在图中第二类
第二种情况,成功区分4个类别。准确率–6个错误
5-class
采用5个分类集时,样本数量一共1006张,其中0-76属于第一类,77-118属于第二类,119-354属于第三类,355-475属于第四类,476-1005属于第五类。
准确率–很差。见下图
除了119-354分类准确,其他全部归为了图中第0类。
5-class
采用5个分类集,每40张一类。出现随机现象。
类别划分正确,错误率较低
只划分出3种类别
划分出4中类别
【详细代码】
# -*-coding:utf-8 -*-
def face_distance(face_encodings, face_to_compare):
"""
计算一组特征值与带比较特征值之间的距离,默认采用欧氏距离
参数配置
face_encodings:一组特征值,包含多个
face_to_compare:待比较特征值,只有一个
return:返回不同特征向量之间距离的数组矩阵
"""
import numpy as np
if len(face_encodings) == 0:
return np.empty((0))
'''
利用numpy包进行距离估量
http://blog.csdn.net/u013749540/article/details/51813922
'''
dist=[]
"""
# 欧氏距离,考虑后续邻接边选择weight较大者,此处选取余弦相似度
for i in range(0,len(face_encodings)):
#sim = 1.0/(1.0+np.linalg.norm(face_encodings[i]-face_to_compare))
sim=np.linalg.norm(face_encodings[i]-face_to_compare)
dist.append(sim)
"""
# 余弦相似度
for i in range(0, len(face_encodings)):
num=np.dot(face_encodings[i],face_to_compare)
cos=num/(np.linalg.norm(face_encodings[i])*np.linalg.norm(face_to_compare))
sim=0.5+0.5*cos # 归一化
dist.append(sim)
return dist
def find_all_index(arr,item):
'''获取list中相同元素的索引
输入:
arr:待求取list
item:待获取元素
输出:
相同元素索引,格式为list'''
return [i for i, a in enumerate(arr) if a==item]
def _chinese_whispers(threshold=0.675, iterations=10):
""" Chinese Whisper Algorithm
算法概要
1.初始化每个节点为一个类
2.选取任意节点开始迭代
选择该节点邻居中边权重最大者,将两则归为一类;若邻居中有2者以上属于同一类,将这些类权重相加进行比较
输入:
encoding_list:待分类的特征向量组
threshold:判断门限,判断两个向量是否相关
iteration:迭代次数
输出:
sorted_clusters:一组分类结果,按从大到小排列
"""
from random import shuffle
import networkx as nx
import numpy as np
import re
# Create graph
nodes = []
edges = []
#encoding_list格式为
#[(path1,encode1),(path2,encode2),(path3,encode3)]
#image_paths, encodings = zip(*encoding_list)
feature_matrix=np.loadtxt(r'F:\5.txt')
encodings=[]
#image_paths=[]
for i in range(0,len(feature_matrix)):
encodings.append(feature_matrix[i,:])
#image_paths.append(r'F:\outCluster\%d\\' %i)
if len(encodings) <= 1:
print ("No enough encodings to cluster!")
return []
'''
节点初始化:
1.将每个特征向量设为一个类
2.计算每个特征向量之间的距离,并根据门限判定是否构成邻接边
'''
for idx, face_encoding_to_check in enumerate(encodings):
# Adding node of facial encoding
node_id = idx
# 节点属性包括
# node_id:节点id,(0,n-1)
# label:节点类别,初始化每个节点一个类别
# path:节点导出路径,用于图片分类导出
node = (node_id, {'label':idx})
#node = (node_id, {'label': idx, 'path': image_paths[idx]})
nodes.append(node)
# Facial encodings to compare
if (idx+1) >= len(encodings):
# Node is last element, don't create edge
break
#构造比较向量组
#若当前向量为i,则比较向量组为[i+1:n]
compare_encodings = encodings[idx+1:]
distances = face_distance(compare_encodings, face_encoding_to_check)
encoding_edges = []
for i, distance in enumerate(distances):
# 若人脸特征匹配,则在这两个节点间添加关联边
if distance >= threshold:
#edge_id:与node_id相连接的节点的node_id
edge_id = idx+i+1
encoding_edges.append((node_id, edge_id, {'weight': distance}))
edges = edges + encoding_edges
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
'''
迭代过程
'''
for _ in range(0, iterations):
cluster_nodes = list(G.nodes()) #返回节点id
shuffle(cluster_nodes)# 随机选取一个开始节点
for node in cluster_nodes:
# 当前节点的所有邻接边,如节点4邻接边为(4,5,weight=8)(4,8,weight=10)
# 则G[4]返回值为AtlasView({5:{'weight':8}, 8:{'weight':10}})
neighbors = G[node]
# cluster形式
# {'cluster_path':weight} 其中cluster_paht=node属性的cluster值
labels = {}
for ne in neighbors: # ne即为当前节点邻接的节点id
if isinstance(ne, int):
'''
判断该邻居的类别是否在其他邻居中存在
若存在,则将相同类别的权重相加。
'''
if G.node[ne]['label'] in labels:#G.node[ne]['label']即为id=ne节点的label属性
labels[G.node[ne]['label']] += G[node][ne]['weight']#将这条邻接边(node,ne)的weight属性赋值给cluster[节点ne的cluster]
else:
labels[G.node[ne]['label']] = G[node][ne]['weight']
# find the class with the highest edge weight sum
edge_weight_sum = 0
max_cluster = 0
#将邻居节点的权重最大值对应的文件路径给到当前节点
#这里cluster即为path
for id in labels:
if labels[id] > edge_weight_sum:
edge_weight_sum = labels[id]
max_cluster = id
# set the class of target node to the winning local class
#print('node %s was clustered in %s' %(node, max_cluster))
G.node[node]['label'] = max_cluster
list_label_out = []
for i in range(len(encodings)):
list_label_out.append(G.node[i]['label'])
#print(list_label_out)
'''
统计分类错误数量=新类别中不属于原类别的数量 eg: list_label_out=[1,3,4,2,2,4,3,1]
# group_all 返回最终类别标签 group_all=[1,2,3,4]
# group_num 最终分类数量 group_num=4
# group_cluster: list,返回相同标签的节点id group_cluster=[[0,7],[3,4],[1,6],[2,5]]
'''
group_all = set(list_label_out)
group_num = len(group_all)
group_cluster = []
for item in group_all:
group_cluster.append(find_all_index(list_label_out,item))
print('最终分类数量:%s' %group_num)
for i in range(0,group_num):
print('第%d类:%s'%(i,group_cluster[i]))
if __name__ == '__main__':
_chinese_whispers()
更多推荐
所有评论(0)