基于Python的情绪识别模型:从原理到实践
摘要
情绪识别作为自然语言处理(NLP)领域的重要分支,在人机交互、社交媒体分析、客户服务等场景中具有广泛应用。本文系统介绍基于Python的情绪识别模型构建方法,涵盖数据预处理、特征提取、模型选择、训练评估及部署应用等关键环节。通过完整的代码示例,帮助读者快速掌握情绪识别模型的开发流程。
1. 引言
情绪识别旨在从文本、语音、图像等模态中自动识别人类情绪状态。文本情绪识别因其数据易获取、应用场景丰富而备受关注。Python凭借丰富的NLP生态(如NLTK、Transformers、scikit-learn)成为实现情绪识别的首选语言。
本文聚焦文本情绪识别,构建可识别愤怒、快乐、悲伤、恐惧、惊讶、厌恶等基本情绪的模型。
2. 技术路线概览
text
数据采集 → 文本清洗 → 特征提取 → 模型训练 → 评估调优 → 部署应用
3. 数据准备
3.1 常用数据集
| 数据集 | 语言 | 类别 | 规模 |
|---|---|---|---|
| ISEAR | 英文 | 7类 | 约7600条 |
| GoEmotions | 英文 | 27类 | 5.8万条 |
| 中文情感语料库 | 中文 | 6类 | 约1.2万条 |
| ChnSentiCorp | 中文 | 2-4类 | 1万条 |
3.2 数据加载示例
python
import pandas as pd
from sklearn.model_selection import train_test_split
# 加载数据集(示例数据结构:text, label)
data = pd.read_csv('emotion_dataset.csv')
# label映射:0-愤怒,1-快乐,2-悲伤,3-恐惧,4-惊讶,5-厌恶
print(data['label'].value_counts())
# 划分训练集和测试集
train_texts, test_texts, train_labels, test_labels = train_test_split(
data['text'], data['label'], test_size=0.2, random_state=42, stratify=data['label']
)
4. 文本预处理
python
import re
import jieba # 中文分词
from nltk.corpus import stopwords
def preprocess_chinese(text):
# 1. 去除特殊符号、数字、英文(根据需求保留)
text = re.sub(r'[a-zA-Z0-9]', '', text)
text = re.sub(r'[^\u4e00-\u9fa5]', '', text) # 仅保留中文
# 2. 分词
words = jieba.lcut(text)
# 3. 去停用词
stop_words = set(stopwords.words('chinese'))
words = [w for w in words if w not in stop_words and len(w) > 1]
return ' '.join(words)
# 应用预处理
train_texts_clean = [preprocess_chinese(t) for t in train_texts]
5. 特征提取方法
5.1 TF-IDF(传统方法)
python
from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2)) X_train_tfidf = tfidf.fit_transform(train_texts_clean) X_test_tfidf = tfidf.transform(test_texts_clean)
5.2 Word2Vec + 平均池化
python
from gensim.models import Word2Vec
import numpy as np
# 训练Word2Vec(或加载预训练模型)
sentences = [text.split() for text in train_texts_clean]
w2v_model = Word2Vec(sentences, vector_size=128, window=5, min_count=2, workers=4)
def text_to_vector(text, model, vector_size):
words = text.split()
vectors = [model.wv[w] for w in words if w in model.wv]
return np.mean(vectors, axis=0) if vectors else np.zeros(vector_size)
X_train_w2v = np.array([text_to_vector(t, w2v_model, 128) for t in train_texts_clean])
5.3 预训练模型(BERT等)
python
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')
bert_model = AutoModel.from_pretrained('bert-base-chinese')
def get_bert_embedding(text):
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128, padding=True)
with torch.no_grad():
outputs = bert_model(**inputs)
return outputs.last_hidden_state[:, 0, :].numpy() # CLS token
6. 模型构建与训练
6.1 传统机器学习模型(逻辑回归)
python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
# 使用TF-IDF特征
lr_model = LogisticRegression(max_iter=1000, multi_class='ovr')
lr_model.fit(X_train_tfidf, train_labels)
y_pred = lr_model.predict(X_test_tfidf)
print(f"Accuracy: {accuracy_score(test_labels, y_pred):.4f}")
print(classification_report(test_labels, y_pred, target_names=['愤怒','快乐','悲伤','恐惧','惊讶','厌恶']))
6.2 深度学习模型(TextCNN)
python
import torch.nn as nn
import torch.nn.functional as F
class TextCNN(nn.Module):
def __init__(self, vocab_size, embed_dim, num_classes, filter_sizes=[2,3,4], num_filters=100):
super(TextCNN, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.convs = nn.ModuleList([
nn.Conv2d(1, num_filters, (k, embed_dim)) for k in filter_sizes
])
self.dropout = nn.Dropout(0.5)
self.fc = nn.Linear(len(filter_sizes) * num_filters, num_classes)
def forward(self, x):
x = self.embedding(x) # (batch, seq_len, embed_dim)
x = x.unsqueeze(1) # (batch, 1, seq_len, embed_dim)
conv_outs = []
for conv in self.convs:
conv_out = F.relu(conv(x)).squeeze(3) # (batch, num_filters, seq_len - k + 1)
pool_out = F.max_pool1d(conv_out, conv_out.size(2)).squeeze(2)
conv_outs.append(pool_out)
x = torch.cat(conv_outs, dim=1)
x = self.dropout(x)
return self.fc(x)
# 训练代码(需配合DataLoader使用,此处省略详细实现)
6.3 微调BERT(推荐方案)
python
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=6)
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
evaluation_strategy='epoch',
)
# 需要将文本转换为BERT输入格式
def tokenize_function(texts):
return tokenizer(texts, padding=True, truncation=True, max_length=128)
train_encodings = tokenize_function(train_texts_clean)
test_encodings = tokenize_function(test_texts_clean)
# 创建Dataset类(代码略)
# trainer = Trainer(...)
7. 模型评估
7.1 评估指标
python
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay import matplotlib.pyplot as plt # 计算混淆矩阵 cm = confusion_matrix(test_labels, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['愤怒','快乐','悲伤','恐惧','惊讶','厌恶']) disp.plot(cmap='Blues') plt.show()
7.2 各类别性能分析
典型的情绪识别模型性能(BERT微调后):
| 情绪类别 | 精确率 | 召回率 | F1分数 |
|---|---|---|---|
| 愤怒 | 0.85 | 0.83 | 0.84 |
| 快乐 | 0.90 | 0.92 | 0.91 |
| 悲伤 | 0.82 | 0.84 | 0.83 |
| 恐惧 | 0.78 | 0.75 | 0.76 |
| 惊讶 | 0.81 | 0.79 | 0.80 |
| 厌恶 | 0.79 | 0.77 | 0.78 |
8. 模型部署
8.1 使用FastAPI部署
python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
app = FastAPI()
# 加载模型和向量化器
model = joblib.load('emotion_model.pkl')
vectorizer = joblib.load('tfidf_vectorizer.pkl')
label_map = {0: '愤怒', 1: '快乐', 2: '悲伤', 3: '恐惧', 4: '惊讶', 5: '厌恶'}
class TextRequest(BaseModel):
text: str
@app.post("/predict")
async def predict_emotion(request: TextRequest):
# 预处理
cleaned = preprocess_chinese(request.text)
# 特征提取
features = vectorizer.transform([cleaned])
# 预测
pred_label = model.predict(features)[0]
pred_proba = model.predict_proba(features)[0].max()
return {
"text": request.text,
"emotion": label_map[pred_label],
"confidence": float(pred_proba)
}
@app.get("/health")
async def health_check():
return {"status": "ok"}
# 启动命令:uvicorn main:app --reload
8.2 测试API
bash
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"text":"今天天气真好,心情特别愉快!"}'
响应示例:
json
{
"text": "今天天气真好,心情特别愉快!",
"emotion": "快乐",
"confidence": 0.96
}
9. 优化建议
| 优化方向 | 具体方法 | 预期提升 |
|---|---|---|
| 数据增强 | 回译、同义词替换、随机删除 | +2~5% F1 |
| 特征融合 | TF-IDF + BERT Embedding | +1~3% Acc |
| 模型集成 | 投票/加权融合多模型 | +3~6% Acc |
| 后处理 | 阈值调整、序列平滑 | 提升稳定性 |
| 类别不平衡 | Focal Loss、过采样/欠采样 | 改善少数类 |
10. 完整项目结构
text
emotion_recognition/ ├── data/ │ ├── raw/ │ └── processed/ ├── models/ │ ├── bert_emotion/ │ └── saved_models/ ├── notebooks/ │ └── exploration.ipynb ├── src/ │ ├── preprocess.py │ ├── features.py │ ├── train.py │ ├── evaluate.py │ └── predict.py ├── api/ │ └── main.py ├── requirements.txt └── README.md
11. 总结与展望
本文系统介绍了基于Python的情绪识别模型开发全流程。从数据预处理到模型部署,涵盖了传统机器学习和深度学习方法。实际应用中,建议优先选择预训练模型(如BERT、RoBERTa、ERNIE)进行微调,通常能获得最佳效果。
未来方向:
-
多模态情绪识别:融合文本、语音、面部表情
-
细粒度情绪分析:识别更复杂的复合情绪
-
对话情绪追踪:结合上下文理解情绪演变
-
低资源场景:少样本学习和跨语言迁移
参考文献
-
Devlin J, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.
-
Kim Y. Convolutional Neural Networks for Sentence Classification. EMNLP 2014.
-
Demszky D, et al. GoEmotions: A Dataset of Fine-Grained Emotions. ACL 2020.
AtomGit 是由开放原子开源基金会联合 CSDN 等生态伙伴共同推出的新一代开源与人工智能协作平台。平台坚持“开放、中立、公益”的理念,把代码托管、模型共享、数据集托管、智能体开发体验和算力服务整合在一起,为开发者提供从开发、训练到部署的一站式体验。
更多推荐

所有评论(0)