【sklearn】利用sklearn训练LDA主题模型及调参详解

冰糖少女

72500人浏览 · 2017-07-31 15:50:22

冰糖少女 · 2017-07-31 15:50:22 发布

人生苦短，我爱python，尤爱sklearn。sklearn不仅提供了机器学习基本的预处理、特征提取选择、分类聚类等模型接口，还提供了很多常用语言模型的接口，sklearn.decomposition.LatentDirichletAllocation就是其中之一。本文除了介绍LDA模型的基本参数、调用训练以外，还将提供几种LDA调参的可行策略，供大家参考讨论。考虑到篇幅，本文将略去LDA原理证明的部分，想要学习的宝宝们请前往LDA数学八卦进行深入学习，绝对受益匪浅！

LDA主题模型训练与调参

（1）加载语料库及预处理

本文选用的语料库为sklearn自带API的20newsgroups语料库，该语料库包含商业、科技、运动、航空航天等多领域新闻资料，很适合NLP的初学者进行使用。sklearn_20newsgroups给出了非常详细的介绍。
预处理方面，直接调用了NLTK的接口进行小写化、分词、去除停用词、POS筛选及词干化。这里进行哪些操作完全根据实际需要和数据来定，比如我就经常放弃词干化或者放弃POS筛选（原因通常是结果不好==）…以下代码为加载20newsgroups数据及文本预处理部分代码。

#加载数据
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples] #截取需要的量，n_samples=2000

#文本预处理, 可选项
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
def textPrecessing(text):
    #小写化
    text = text.lower()
    #去除特殊标点
    for c in string.punctuation:
        text = text.replace(c, ' ')
    #分词
    wordLst = nltk.word_tokenize(text)
    #去除停用词
    filtered = [w for w in wordLst if w not in stopwords.words('english')]
    #仅保留名词或特定POS   
    refiltered =nltk.pos_tag(filtered)
    filtered = [w for w, pos in refiltered if pos.startswith('NN')]
    #词干化
    ps = PorterStemmer()
    filtered = [ps.stem(w) for w in filtered]

    return " ".join(filtered)

以上代码运行时间不长，是因为我只随机（shuffle=True）截取了n_samples=2000条新闻。但是当语料库较大时，通常预处理时间也会久一点。因此如果文本数据不变，最好对预处理结果进行保存，这样每次运行只消从文件里读数据即可。

#该区域仅首次运行，进行文本预处理，第二次运行起注释掉
docLst = []
for desc in data_samples :
    docLst.append(textPrecessing(desc).encode('utf-8'))
with open(textPre_FilePath, 'w') as f:
    for line in docLst:
        f.write(line+'\n')

#==============================================================================
#从第二次运行起，直接获取预处理过的docLst，前面load数据、预处理均注释掉
#docLst = []
#with open(textPre_FilePath, 'r') as f:
#    for line in f.readlines():
#        if line != '':
#            docLst.append(line.strip())
#==============================================================================

我随便打印了两条20newsgroups的数据和预处理后的结果，预处理时未进行POS筛选及词干化，以方便大家理解。

Output:
Original 20Newsgroups Articles: [u"Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n",
 u'\nJames Hogan writes:\n\ntimmbake@mcl.ucsb.edu (Bake Timmons) writes:\n>>Jim Hogan quips:\n\n>>... (summary of Jim\'s stuff)\n\n>>Jim, I\'m afraid _you\'ve_ missed the point.\n\n>>>Thus, I think you\'ll have to admit that  atheists have a lot\n>>more up their sleeve than you might have suspected.\n\n>>Nah.  I will encourage people to learn about atheism to see how little atheists\n>>have up their sleeves.  Whatever I might have suspected is actually quite\n>>meager.  If you want I\'ll send them your address to learn less about your\n>>faith.\n\n>Faith?\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap of faith, Jimmy.  Your logic runs out\nof steam!\n\n>>>Fine, but why do these people shoot themselves in the foot and mock\n>>>the idea of a God?  ....\n\n>>>I hope you understand now.\n\n>>Yes, Jim.  I do understand now.  Thank you for providing some healthy sarcasm\n>>that would have dispelled any sympathies I would have had for your faith.\n\n>Bake,\n\n>Real glad you detected the sarcasm angle, but am really bummin\' that\n>I won\'t be getting any of your sympathy.  Still, if your inclined\n>to have sympathy for somebody\'s *faith*, you might try one of the\n>religion newsgroups.\n\n>Just be careful over there, though. (make believe I\'m\n>whispering in your ear here)  They\'re all delusional!\n\nJim,\n\nSorry I can\'t pity you, Jim.  And I\'m sorry that you have these feelings of\ndenial about the faith you need to get by.  Oh well, just pretend that it will\nall end happily ever after anyway.  Maybe if you start a new newsgroup,\nalt.atheist.hard, you won\'t be bummin\' so much?\n\n>Good job, Jim.\n>.\n\n>Bye, Bake.\n\n\n>>[more slim-Jim (tm) deleted]\n\n>Bye, Bake!\n>Bye, Bye!\n\nBye-Bye, Big Jim.  Don\'t forget your Flintstone\'s Chewables!  :) \n--\nBake Timmons, III\n\n-- "...there\'s nothing higher, stronger, more wholesome and more useful in life\nthan some good memory..." -- Alyosha in Brothers Karamazov (Dostoevsky)\n']

Articles After Preprocessing: [u'well sure story nad seem biased disagree statement u media ruin israels reputation rediculous u media pro israeli media world lived europe realize incidences one described letter occured u media whole seem try ignore u subsidizing israels existance europeans least degree think might reason report clearly atrocities shame austria daily reports inhuman acts commited israeli soldiers blessing received government makes holocaust guilt go away look jews treating races got power unfortunate',
 u'james hogan writes timmbake mcl ucsb edu bake timmons writes jim hogan quips summary jim stuff jim afraid missed point thus think admit atheists lot sleeve might suspected nah encourage people learn atheism see little atheists sleeves whatever might suspected actually quite meager want send address learn less faith faith yeah expect people read faq etc actually accept hard atheism need little leap faith jimmy logic runs steam fine people shoot foot mock idea god hope understand yes jim understand thank providing healthy sarcasm would dispelled sympathies would faith bake real glad detected sarcasm angle really bummin getting sympathy still inclined sympathy somebody faith might try one religion newsgroups careful though make believe whispering ear delusional jim sorry pity jim sorry feelings denial faith need get oh well pretend end happily ever anyway maybe start new newsgroup alt atheist hard bummin much good job jim bye bake slim jim tm deleted bye bake bye bye bye bye big jim forget flintstone chewables bake timmons iii nothing higher stronger wholesome useful life good memory alyosha brothers karamazov dostoevsky']

（2）CountVectorizer统计词频

LDA模型学习时的训练数据并不是一篇篇文本，而是Document-word matrix，它可以是array也可以是稀疏矩阵，维数是n_samples*n_features，其中n_features为词(term)的个数。因此在训练LDA主题模型前，需要先利用CountVectorizer统计词频并保存，代码如下：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib  #也可以选择pickle等保存模型，请随意

#构建词汇统计向量并保存，仅运行首次
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(docLst)
joblib.dump(tf_vectorizer,tf_ModelPath )
#==============================================================================
# #得到存储的tf_vectorizer,节省预处理时间
# tf_vectorizer = joblib.load(tf_ModelPath)
# tf = tf_vectorizer.fit_transform(docLst)
#==============================================================================

CountVectorizer的API请自行参考sklearn，文中代码限定term出现次数必须大于2，最终保留前n_features=2500的term作为features。训练得到的tf_vectorizer 利用joblib保存到文件，第二次起可以直接从文件中load进来避免重复计算。该步骤得到的tf矩阵为一个“文章-词语”稀疏矩阵，可以通过tf_vectorizer.get_feature_names()得到每一维feature对应的term。

（3）LDA主题模型训练

终于到了最关键的LDA主题模型训练阶段。虽说此阶段最关键，但如果数据质量高，如果前面的步骤没有偷工减料，这步其实水到渠成；反之，问题可能都会累计到此阶段集中的反映出来。要想训练优秀的主题模型，两个重要的前提就是数据质量和文本预处理。在此特别安利一下用起来舒服的预处理包：中文–>jieba，英文–>spaCy。上文采用nltk实属无奈，因为这台电脑无法成功安装spaCy唉。。
好了不跑题。LDA训练代码如下，其中参数请参考最后面的附录sklearn LDA API 中文解释。

from sklearn.decomposition import LatentDirichletAllocation
n_topics = 30
lda = LatentDirichletAllocation(n_topics=n_topic, 
                                max_iter=50,
                                learning_method='batch')
lda.fit(tf) #tf即为Document_word Sparse Matrix

（4）结果展示

LDA的训练时间根据max_iter设置的不同以及数据收敛情况的不同而差别很大。测试时max_iter设置为几十次通常很快就会结束，当然如果实际应用的话，建议至少上千次吧。

Topic Top Words结果

def print_top_words(model, feature_names, n_top_words):
    #打印每个主题下权重较高的term
    for topic_idx, topic in enumerate(model.components_):
        print "Topic #%d:" % topic_idx
        print " ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]])
    print
    #打印主题-词语分布矩阵
    print model.components_

n_top_words=20
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Output:
#每个主题下权重较高的词语
Topic #0:
mail edu thanks new send email 00 com internet interested info uk price ac know sale fax copy data following
Topic #1:
gm win rochester edu michael new fred vs adams tommy gov nick gb main hudson issue alaska nasa space people
Topic #2:
55 10 11 18 21 17 13 19 16 period 22 23 14 20 25 15 24 12 93 26
Topic #3:
color server motif software input output edu support clock 256 bits linux vga shots default mode level using image xterm
Topic #4:
edu writes article com know like uiuc cc news cs people cso opinions think david really way right heard sure
Topic #5:
section military shall dangerous firearm weapon law person state license use means following women designed islamic japanese division men issued
Topic #6:
like know time good bike com really writes course year ride going think got read live years better big high
Topic #7:
com edu writes article list andrew apple cmu cs sandvik points toronto ca kent vancouver sphere power point portal cup
Topic #8:
know ca black use white edu think writes light like signal right old used dave bnr want mouse led let
Topic #9:
drive disk drives hard controller rom card bios floppy flyers 16 feature supports board speed bus interface power mb data
Topic #10:
people government think president american weapons country clinton mr support time billion make new say like going state states jobs
Topic #11:
edu insurance hp writes article like offer cable best turbo use port power se speed hd good 25 swap year
Topic #12:
food edu msg writes article standard frank use objective red blues people bear cs area values begin like wings rick
Topic #13:
earth probe moon lunar orbit mission surface mars space spacecraft venus solar jupiter science atmosphere planet planetary images data pioneer
Topic #14:
edu com want good dog writes buy dod sold question dealer article water nec large make used chris audio hp
Topic #15:
israel jews israeli arab jewish attacks state peace people land policy lebanese arabs right say nazi writes men fact soldiers
Topic #16:
com gun writes guns article crime 000 self edu likely isc stratus make texas fbi government way br steve defense
Topic #17:
scsi bit mac 32 tv fast ide cards ibm chip 16 set difference better bytes fpu faster computer use piece
Topic #18:
edu ftp version pc contact machines available type pub au comments mit anonymous sun mac program unix math looking written
Topic #19:
car cars turkish engine greek oil tires speed turks brake miles greeks 000 better new brakes good dot tire wheel
Topic #20:
god people think jesus edu believe say bible way good know christian point life like church law time faith says
Topic #21:
use using key number time like want used problem idea need know serial example code data traffic application keys case
Topic #22:
university april science 1993 research disease program health information new study medicine power energy computer papers time process development conference
Topic #23:
space years nasa gov new year launch 10 sci pitt gay shuttle km 15 article medical titan soon high 1990
Topic #24:
people said went know going time children think like came home killed happened took armenians come got told away dead
Topic #25:
graphics image mail pub edu aids ray 128 files package mil images 3d send sgi computer systems archive gov format
Topic #26:
windows file problem use edu window thanks files help card know dos like monitor using memory work video program need
Topic #27:
game team play year players season think games hockey player win cubs teams better good baseball ca fan leafs league
Topic #28:
writes com edu article atheism bob jim tek word rights used people news case keith alt said term time given
Topic #29:
government key encryption chip clipper public use keys law people enforcement private nsa security like secure phone com think care

#主题-词语分布矩阵
array([[  1.00377390e+02,   3.33333333e-02,   3.33333333e-02, ...,
          3.33333333e-02,   3.33333333e-02,   3.33333333e-02],
       [  3.33333333e-02,   3.33333333e-02,   3.33333333e-02, ...,
          3.33333333e-02,   3.33333333e-02,   3.33333333e-02],
       [  1.13445534e+01,   3.33333333e-02,   1.31402890e+01, ...,
          3.33333333e-02,   3.33333333e-02,   3.33333333e-02],
       ...,
       [  3.33333333e-02,   3.33333333e-02,   3.33333333e-02, ...,
          3.33333333e-02,   9.23349606e+00,   3.33333333e-02],
       [  3.33333333e-02,   3.33333333e-02,   3.33333333e-02, ...,
          3.33333333e-02,   3.33333333e-02,   3.33333333e-02],
       [  3.33333333e-02,   3.33333333e-02,   3.33333333e-02, ...,
          3.33333333e-02,   3.33333333e-02,   3.33333333e-02]])

检查了一眼每个主题的top words，基本是靠谱的，比如教育类在一起，机械类在一起等等，当然也存在一些问题，比如训练还不到位，比如没有进行词干化所有”car”“cars”都在Topic #19里面，大家训练的时候得避免。

Doc_Topic结果

训练LDA的一大目的就是分析一篇文章的话题分布，这才能使得模型创造更高的价值。利用已训练好的模型将doc转换为话题分布的函数及结果如下：

doc_topic_dist = lda.transform(tf)

output：
array([[  0.03333333,   0.03333333,   0.03333333, ...,   0.03333333,
          0.03333333,   0.03333333],
       [  0.03333333,   0.03333333,   0.03333333, ...,   1.9426311 ,
         26.11962169,   0.03333333],
       [  0.03333333,   0.03333333,   0.03333333, ...,   0.03333333,
          0.03333333,   0.03333333],
       ...,
       [  0.03333333,   0.03333333,  15.99360499, ...,   0.03333333,
          0.03333333,   0.03333333],
       [  0.03333333,   0.03333333,   0.03333333, ...,   0.03333333,
          0.03333333,   0.03333333],
       [  0.03333333,   0.03333333,   0.03333333, ...,  13.36262244,
          0.03333333,   0.03333333]])

上文中，我给出了两篇例文，那两篇例文的主要话题为：topic#12, topic#20.大家可以自行看一下效果如何。好吧结果可能不太好，原因很多，可能是还没调参，也可能因为预处理为了节省时间，省去了词干化和POS筛选，大家加进去即可。

收敛效果(perplexity)

通过调用lda.perplexity(X)函数，可以得知当前训练的perplexity，sklearn中对perplexity的定义为exp(-1. * log-likelihood per word)

lda.perplexity(tf)

Output: 
1270.5358245980792

本次训练次数较少，模型还没收敛，所以perplexity明显较高，可以通过调参得到更可靠的模型。

（5）（Optional）调参过程

可以调整的参数

n_topics: 主题的个数
n_features: feature的个数，即常用词个数
doc_topic_prior:即我们的文档主题先验Dirichlet分布θd的参数α
topic_word_prior:即我们的主题词先验Dirichlet分布βk的参数η
learning_method: 即LDA的求解算法，有’batch’和’online’两种选择
其余sklearn提供的参数：根据LDA求解算法的不同，存在一些其它参数可以调节，参见最后的附录：sklearn LDA API 中文解释。

两种可行的调参方案

一、以n_topics为例，按照perplexity的大小选择最佳模型。当然，topic数目的不同势必会导致perplexity计算的不同，因此perplexity仅能作为参考，topic数目还需要根据实际需求主观指定。n_topics调参代码如下：

n_topics = range(20, 75, 5)
perplexityLst = [1.0]*len(n_topics)

#训练LDA并打印训练时间
lda_models = []
for idx, n_topic in enumerate(n_topics):
    lda = LatentDirichletAllocation(n_topics=n_topic,
                                    max_iter=20,
                                    learning_method='batch',
                                    evaluate_every=200,
#                                    perp_tol=0.1, #default                                       
#                                    doc_topic_prior=1/n_topic, #default
#                                    topic_word_prior=1/n_topic, #default
                                    verbose=0)
    t0 = time()
    lda.fit(tf)
    perplexityLst[idx] = lda.perplexity(tf)
    lda_models.append(lda)
    print "# of Topic: %d, " % n_topics[idx],
    print "done in %0.3fs, N_iter %d, " % ((time() - t0), lda.n_iter_),
    print "Perplexity Score %0.3f" % perplexityLst[idx]

#打印最佳模型
best_index = perplexityLst.index(min(perplexityLst))
best_n_topic = n_topics[best_index]
best_model = lda_models[best_index]
print "Best # of Topic: ", best_n_topic

#绘制不同主题数perplexity的不同
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(n_topics, perplexityLst)
ax.set_xlabel("# of topics")
ax.set_ylabel("Approximate Perplexity")
plt.grid(True)
plt.savefig(os.path.join('lda_result', 'perplexityTrend'+CODE+'.png'))
plt.show()

Output:
Best # of Topic:  25
![不同主题数下perplexity趋势](http://img.blog.csdn.net/20170731171742934?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvVGlmZmFueVJhYmJpdA==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)

二、如果想一次性调整所有参数也可以直接利用sklearn作cv，但是这样做的结果一定是，耗时十分长。以下代码仅供参考，可以根据自身的需求进行增减。

from sklearn.model_selection import GridSearchCV
parameters = {'learning_method':('batch', 'online'), 
              'n_topics':range(20, 75, 5),
              'perp_tol': (0.001, 0.01, 0.1),
              'doc_topic_prior':(0.001, 0.01, 0.05, 0.1, 0.2),
              'topic_word_prior':(0.001, 0.01, 0.05, 0.1, 0.2)
              'max_iter':1000}
lda = LatentDirichletAllocation()
model = GridSearch(lda, parameters)
model.fit(tf)

sorted(model.cv_results_.keys())

附录：sklearn LDA API 中文解释

Class sklearn.decomposition.LatentDirichletAllocation(n_topics=10, doc_topic_prior=None, topic_word_prior=None, learning_method=None, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=1, verbose=0, random_state=None)

参数：
1) n_topics: 即我们的隐含主题数K,需要调参。K的大小取决于我们对主题划分的需求，比如我们只需要类似区分是动物，植物，还是非生物这样的粗粒度需求，那么K值可以取的很小，个位数即可。如果我们的目标是类似区分不同的动物以及不同的植物，不同的非生物这样的细粒度需求，则K值需要取的很大，比如上千上万。此时要求我们的训练文档数量要非常的多。
2) doc_topic_prior:即我们的文档主题先验Dirichlet分布θd的参数α。一般如果我们没有主题分布的先验知识，可以使用默认值1/K。
3) topic_word_prior:即我们的主题词先验Dirichlet分布βk的参数η。一般如果我们没有主题分布的先验知识，可以使用默认值1/K。
4) learning_method: 即LDA的求解算法。有 ‘batch’ 和 ‘online’两种选择。 ‘batch’即我们在原理篇讲的变分推断EM算法，而”online”即在线变分推断EM算法，在”batch”的基础上引入了分步训练，将训练样本分批，逐步一批批的用样本更新主题词分布的算法。默认是”online”。选择了‘online’则我们可以在训练时使用partial_fit函数分布训练。不过在scikit-learn 0.20版本中默认算法会改回到”batch”。建议样本量不大只是用来学习的话用”batch”比较好，这样可以少很多参数要调。而样本太多太大的话，”online”则是首先了。
5）learning_decay：仅仅在算法使用”online”时有意义，取值最好在(0.5, 1.0]，以保证”online”算法渐进的收敛。主要控制”online”算法的学习率，默认是0.7。一般不用修改这个参数。
6）learning_offset：仅仅在算法使用”online”时有意义，取值要大于1。用来减小前面训练样本批次对最终模型的影响。
7）max_iter ：EM算法的最大迭代次数。
8）total_samples：仅仅在算法使用”online”时有意义，即分步训练时每一批文档样本的数量。在使用partial_fit函数时需要。
9）batch_size: 仅仅在算法使用”online”时有意义，即每次EM算法迭代时使用的文档样本的数量。
10）mean_change_tol :即E步更新变分参数的阈值，所有变分参数更新小于阈值则E步结束，转入M步。一般不用修改默认值。
11） max_doc_update_iter: 即E步更新变分参数的最大迭代次数，如果E步迭代次数达到阈值，则转入M步。

方法：
1）fit(X[, y])：利用训练数据训练模型，输入的X为文本词频统计矩阵。
2）fit_transform(X[, y])：利用训练数据训练模型，并返回训练数据的主题分布。
3）get_params([deep])：获取参数
4）partial_fit(X[, y])：利用小batch数据进行Online方式的模型训练。
5）perplexity(X[, doc_topic_distr, sub_sampling])：计算X数据的approximate perplexity。
6）score(X[, y])：计算approximate log-likelihood。
7）set_params(**params)：设置参数。
8）transform(X)：利用已有模型得到语料X中每篇文档的主题分布。
“`