使用RNN对文本进行分类实践：电影评论 (tensorflow2.0官方教程翻译)

本教程在IMDB大型影评数据集上训练一个循环神经网络进行情感分类。


x
1
from __future__ import absolute_import, division, print_function, unicode_literals
2
3
# !pip install tensorflow-gpu==2.0.0-alpha0
4
import tensorflow_datasets as tfds
5
import tensorflow as tf

导入matplotlib并创建一个辅助函数来绘制图形


xxxxxxxxxx
10
1
import matplotlib.pyplot as plt
2
3
4
def plot_graphs(history, string):
5
  plt.plot(history.history[string])
6
  plt.plot(history.history['val_'+string])
7
  plt.xlabel("Epochs")
8
  plt.ylabel(string)
9
  plt.legend([string, 'val_'+string])
10
  plt.show()

1. 设置输入管道

IMDB大型电影影评数据集是一个二元分类数据集，所有评论都有正面或负面的情绪标签。

使用TFDS下载数据集，数据集附带一个内置的子字标记器


xxxxxxxxxx
3
1
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
2
                          as_supervised=True)
3
train_dataset, test_dataset = dataset['train'], dataset['test']

由于这是一个子字标记器，它可以传递任何字符串，并且标记器将对其进行标记。


xxxxxxxxxx
3
1
tokenizer = info.features['text'].encoder
2
3
print ('Vocabulary size: {}'.format(tokenizer.vocab_size))


xxxxxxxxxx
1
1
      Vocabulary size: 8185


xxxxxxxxxx
9
1
sample_string = 'TensorFlow is cool.'
2
3
tokenized_string = tokenizer.encode(sample_string)
4
print ('Tokenized string is {}'.format(tokenized_string))
5
6
original_string = tokenizer.decode(tokenized_string)
7
print ('The original string: {}'.format(original_string))
8
9
assert original_string == sample_string


xxxxxxxxxx
2
1
      Tokenized string is [6307, 2327, 4043, 4265, 9, 2724, 7975]
2
      The original string: TensorFlow is cool.

如果字符串不在字典中，则标记生成器通过将字符串分解为子字符串来对字符串进行编码。


xxxxxxxxxx
2
1
for ts in tokenized_string:
2
  print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))


xxxxxxxxxx
7
1
    6307 ----> Ten
2
    2327 ----> sor
3
    4043 ----> Fl
4
    4265 ----> ow
5
    9 ----> is
6
    2724 ----> cool
7
    7975 ----> .


xxxxxxxxxx
7
1
BUFFER_SIZE = 10000
2
BATCH_SIZE = 64
3
4
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
5
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
6
7
test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

2. 创建模型

构建一个tf.keras.Sequential模型并从嵌入层开始，嵌入层每个字存储一个向量，当被调用时，它将单词索引的序列转换为向量序列，这些向量是可训练的，在训练之后（在足够的数据上），具有相似含义的词通常具有相似的向量。

这种索引查找比通过tf.keras.layers.Dense层传递独热编码向量的等效操作更有效。

递归神经网络（RNN）通过迭代元素来处理序列输入，RNN将输出从一个时间步传递到其输入端，然后传递到下一个时间步。

tf.keras.layers.Bidirectional包装器也可以与RNN层一起使用。这通过RNN层向前和向后传播输入，然后连接输出。这有助于RNN学习远程依赖性。


xxxxxxxxxx
11
1
model = tf.keras.Sequential([
2
    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
3
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
4
    tf.keras.layers.Dense(64, activation='relu'),
5
    tf.keras.layers.Dense(1, activation='sigmoid')
6
])
7
8
# 编译Keras模型以配置训练过程：
9
model.compile(loss='binary_crossentropy',
10
              optimizer='adam',
11
              metrics=['accuracy'])

3. 训练模型


xxxxxxxxxx
2
1
history = model.fit(train_dataset, epochs=10,
2
                    validation_data=test_dataset)


xxxxxxxxxx
3
1
      ...
2
      Epoch 10/10
3
      391/391 [==============================] - 70s 180ms/step - loss: 0.3074 - accuracy: 0.8692 - val_loss: 0.5533 - val_accuracy: 0.7873


xxxxxxxxxx
4
1
test_loss, test_acc = model.evaluate(test_dataset)
2
3
print('Test Loss: {}'.format(test_loss))
4
print('Test Accuracy: {}'.format(test_acc))


xxxxxxxxxx
2
1
          391/Unknown - 19s 47ms/step - loss: 0.5533 - accuracy: 0.7873Test Loss: 0.553319326714
2
      Test Accuracy: 0.787320017815

上面的模型没有屏蔽应用于序列的填充。如果我们对填充序列进行训练，并对未填充序列进行测试，就会导致偏斜。理想情况下，模型应该学会忽略填充，但是正如您在下面看到的，它对输出的影响确实很小。

如果预测 >=0.5，则为正，否则为负。


xxxxxxxxxx
14
1
def pad_to_size(vec, size):
2
  zeros = [0] * (size - len(vec))
3
  vec.extend(zeros)
4
  return vec
5
6
def sample_predict(sentence, pad):
7
  tokenized_sample_pred_text = tokenizer.encode(sample_pred_text)
8
9
  if pad:
10
    tokenized_sample_pred_text = pad_to_size(tokenized_sample_pred_text, 64)
11
12
  predictions = model.predict(tf.expand_dims(tokenized_sample_pred_text, 0))
13
14
  return (predictions)


xxxxxxxxxx
6
1
# 对不带填充的示例文本进行预测 
2
3
sample_pred_text = ('The movie was cool. The animation and the graphics '
4
                    'were out of this world. I would recommend this movie.')
5
predictions = sample_predict(sample_pred_text, pad=False)
6
print (predictions)


xxxxxxxxxx
1
1
        [[ 0.68914342]]


xxxxxxxxxx
6
1
# 对带填充的示例文本进行预测 
2
3
sample_pred_text = ('The movie was cool. The animation and the graphics '
4
                    'were out of this world. I would recommend this movie.')
5
predictions = sample_predict(sample_pred_text, pad=True)
6
print (predictions)


xxxxxxxxxx
1
1
       [[ 0.68634349]]


xxxxxxxxxx
1
1
plot_graphs(history, 'accuracy')

png


xxxxxxxxxx
1
1
plot_graphs(history, 'loss')

png

4. 堆叠两个或更多LSTM层

Keras递归层有两种可以用的模式，由return_sequences构造函数参数控制：

返回每个时间步的连续输出的完整序列（3D张量形状 (batch_size, timesteps, output_features)）。
仅返回每个输入序列的最后一个输出（2D张量形状 (batch_size, output_features)）。


xxxxxxxxxx
15
1
model = tf.keras.Sequential([
2
    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
3
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
4
        64, return_sequences=True)),
5
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
6
    tf.keras.layers.Dense(64, activation='relu'),
7
    tf.keras.layers.Dense(1, activation='sigmoid')
8
])
9
10
model.compile(loss='binary_crossentropy',
11
              optimizer='adam',
12
              metrics=['accuracy'])
13
14
history = model.fit(train_dataset, epochs=10,
15
                    validation_data=test_dataset)


xxxxxxxxxx
3
1
      ...
2
      Epoch 10/10
3
      391/391 [==============================] - 154s 394ms/step - loss: 0.1120 - accuracy: 0.9643 - val_loss: 0.5646 - val_accuracy: 0.8070


xxxxxxxxxx
4
1
test_loss, test_acc = model.evaluate(test_dataset)
2
3
print('Test Loss: {}'.format(test_loss))
4
print('Test Accuracy: {}'.format(test_acc))


xxxxxxxxxx
2
1
            391/Unknown - 45s 115ms/step - loss: 0.5646 - accuracy: 0.8070Test Loss: 0.564571284348
2
        Test Accuracy: 0.80703997612


xxxxxxxxxx
6
1
# 在没有填充的情况下预测示例文本
2
3
sample_pred_text = ('The movie was not good. The animation and the graphics '
4
                    'were terrible. I would not recommend this movie.')
5
predictions = sample_predict(sample_pred_text, pad=False)
6
print (predictions)


xxxxxxxxxx
1
1
       [[ 0.00393916]]


xxxxxxxxxx
6
1
# 在有填充的情况下预测示例文本
2
3
sample_pred_text = ('The movie was not good. The animation and the graphics '
4
                    'were terrible. I would not recommend this movie.')
5
predictions = sample_predict(sample_pred_text, pad=True)
6
print (predictions)


xxxxxxxxxx
1
1
      [[ 0.01098633]]


xxxxxxxxxx
1
1
plot_graphs(history, 'accuracy')

png


xxxxxxxxxx
1
1
plot_graphs(history, 'loss')

png

查看其它现有的递归层，例如GRU层。

最新版本：https://www.mashangxue123.com/tensorflow/tf2-tutorials-text-text_classification_rnn.html 英文版本：https://tensorflow.google.cn/beta/tutorials/text/text_classification_rnn 翻译建议PR：https://github.com/mashangxue/tensorflow2-zh/edit/master/r2/tutorials/text/text_classification_rnn.md