bert 生成文本句向量

bert

TensorFlow code and pre-trained models for BERT

项目地址：https://gitcode.com/gh_mirrors/be/bert

免费下载资源

shlhhy

6926人浏览 · 2020-09-04 16:56:38

shlhhy · 2020-09-04 16:56:38 发布

之前生成文本句向量的方法是：训练词向量模型w2v，将句子中各词的向量进行平均，现在想尝试一下用bert模型生成句向量。

1. bert模型结构

all_encoder_layers：经过transformer_model函数返回每个block的结果，即对应bert的12个Transformer层
sequence_output：bert最后一层的输出，不明白其与all_encoder_layers最后一层的输出有何不同？

2. 加载bert模型

加载bert模型主要使用modeling文件中的相关函数。

import modeling
import tensorflow as tf

# 加载预训练的 bert 模型配置文件
data_root = '../../chinese_L-12_H-768_A-12/'
bert_config_file = data_root + 'bert_config.json'
bert_vocab_file = data_root + 'vocab.txt'
bert_config = modeling.BertConfig.from_json_file(bert_config_file)
# 微调训练完成的模型，该文件的生成见：https://blog.csdn.net/shlhhy/article/details/107382079
init_checkpoint = '../output/model.ckpt-455'

input_ids = tf.placeholder(shape=[64, 128], dtype=tf.int32, name="input_ids")
input_mask = tf.placeholder(shape=[64, 128], dtype=tf.int32, name="input_mask")
segment_ids = tf.placeholder(shape=[64, 128], dtype=tf.int32, name="segment_ids")

model = modeling.BertModel(
        config=bert_config,
        is_training=False,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=False
)
# 返回模型训练时的所有 trainable = True 的参数
tvars = tf.trainable_variables()
# 加载 model.ckpt-455 模型里的变量和参数形成一个map
assignment, initialized_variable_names = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
# 参数进行初始化
tf.train.init_from_checkpoint(init_checkpoint, assignment)

# 获取最后一层和倒数第二层
encoder_last_layer = model.get_sequence_output()
encoder_last2_layer = model.all_encoder_layers[-2]

3. 读取数据

采用tokenization进行分词

# tokenization.FullTokenizer类用来处理分词，标点符号，unknown词，Unicode转换等操作
# 中文只有单个字的切分，没有词
token = tokenization.FullTokenizer(vocab_file=bert_vocab_file)

# 模型训练时设置最长序列为128，对于长文本向量化显然是不够的，因此将长文本以句号做分割，每句话生成向量
file_input_x_c_test = '../data/test_x.txt'
input_test_data = read_input(file_dir=file_input_x_c_test)

数据集中的一行为一条记录，如图，该条记录共6句话，生成一条向量，包含7个向量数组，长度为128。

在这里插入图片描述

4. 向量生成

一条记录为一个batch，将数据组织成模型需要的格式，喂入bert的倒数第二层中，生成7条128行768列的向量。

word_id, mask, segment = get_batch_data(sample)
feed_data = {input_ids: np.asarray(word_id), input_mask: np.asarray(mask), segment_ids: np.asarray(segment)}
last2 = sess.run(encoder_last2_layer, feed_dict=feed_data)

在这里插入图片描述

GitHub 加速计划 / be / bert

37.61 K

9.55 K

下载

TensorFlow code and pre-trained models for BERT

最近提交(Master分支：2 个月前 )

eedf5716 Add links to 24 smaller BERT models. 4 年前

8028c045 - 4 年前

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m