采用注意力机制的神经机器翻译(tensorflow2.0官方教程翻译)

本教程训练一个序列到序列 (seq2seq)模型，实现西班牙语到英语的翻译。这是一个高级示例，要求您对序列到序列模型有一定的了解。

训练模型后，输入一个西班牙语句子将返回对应英文翻译，例如 "¿todavia estan en casa?" ，返回 "are you still at home?"

对于一个玩具例子来说，翻译质量是合理的，但是生成的注意情节可能更有趣。这说明在翻译过程中，模型注意到了输入句子的哪些部分:

spanish-english attention plot

注意：此示例在单个P100 GPU上运行大约需要10分钟。


x
1
from __future__ import absolute_import, division, print_function, unicode_literals
2
3
import tensorflow as tf
4
5
import matplotlib.pyplot as plt
6
from sklearn.model_selection import train_test_split
7
8
import unicodedata
9
import re
10
import numpy as np
11
import os
12
import io
13
import time

1. 下载并准备数据集

我们将使用 http://www.manythings.org/anki/ 提供的语言数据集。此数据集包含以下格式的语言翻译对：


xxxxxxxxxx
1
1
May I borrow this book? ¿Puedo tomar prestado este libro?

有多种语言可供选择，但我们将使用英语 - 西班牙语数据集。为方便起见，我们在Google Cloud上托管了此数据集的副本，但您也可以下载自己的副本。下载数据集后，以下是我们准备数据的步骤：

为每个句子添加开始和结束标记。
删除特殊字符来清除句子。
创建一个单词索引和反向单词索引（从单词→id和id→单词映射的字典）。
将每个句子填充到最大长度。


xxxxxxxxxx
6
1
# Download the file
2
path_to_zip = tf.keras.utils.get_file(
3
    'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
4
    extract=True)
5
6
path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"


xxxxxxxxxx
3
1
    Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
2
    2646016/2638744 [==============================] - 0s 0us/step
3
    2654208/2638744 [==============================] - 0s 0us/step


xxxxxxxxxx
24
1
# Converts the unicode file to ascii
2
def unicode_to_ascii(s):
3
    return ''.join(c for c in unicodedata.normalize('NFD', s)
4
        if unicodedata.category(c) != 'Mn')
5
6
7
def preprocess_sentence(w):
8
    w = unicode_to_ascii(w.lower().strip())
9
10
    # creating a space between a word and the punctuation following it
11
    # eg: "he is a boy." => "he is a boy ."
12
    # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
13
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
14
    w = re.sub(r'[" "]+', " ", w)
15
16
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
17
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
18
19
    w = w.rstrip().strip()
20
21
    # adding a start and an end token to the sentence
22
    # so that the model know when to start and stop predicting.
23
    w = '<start> ' + w + ' <end>'
24
    return w


xxxxxxxxxx
4
1
en_sentence = u"May I borrow this book?"
2
sp_sentence = u"¿Puedo tomar prestado este libro?"
3
print(preprocess_sentence(en_sentence))
4
print(preprocess_sentence(sp_sentence).encode('utf-8'))


xxxxxxxxxx
2
1
    <start> may i borrow this book ? <end>
2
    <start> ¿ puedo tomar prestado este libro ? <end>


xxxxxxxxxx
9
1
# 1. Remove the accents
2
# 2. Clean the sentences
3
# 3. Return word pairs in the format: [ENGLISH, SPANISH]
4
def create_dataset(path, num_examples):
5
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
6
7
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines[:num_examples]]
8
9
    return zip(*word_pairs)


xxxxxxxxxx
3
1
en, sp = create_dataset(path_to_file, None)
2
print(en[-1])
3
print(sp[-1])


xxxxxxxxxx
2
1
    <start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
2
    <start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>


xxxxxxxxxx
2
1
def max_length(tensor):
2
    return max(len(t) for t in tensor)


xxxxxxxxxx
11
1
def tokenize(lang):
2
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
3
      filters='')
4
  lang_tokenizer.fit_on_texts(lang)
5
6
  tensor = lang_tokenizer.texts_to_sequences(lang)
7
8
  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
9
                                                         padding='post')
10
11
  return tensor, lang_tokenizer


xxxxxxxxxx
8
1
def load_dataset(path, num_examples=None):
2
    # creating cleaned input, output pairs
3
    targ_lang, inp_lang = create_dataset(path, num_examples)
4
5
    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
6
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)
7
8
    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

1.1. 限制数据集的大小以更快地进行实验（可选）

对 > 100,000个句子的完整数据集进行训练需要很长时间。为了更快地训练，我们可以将数据集的大小限制为30,000个句子（当然，翻译质量会随着数据的减少而降低）：


xxxxxxxxxx
6
1
# Try experimenting with the size of that dataset
2
num_examples = 30000
3
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)
4
5
# Calculate max_length of the target tensors
6
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)


xxxxxxxxxx
5
1
# Creating training and validation sets using an 80-20 split
2
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)
3
4
# Show length
5
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)


xxxxxxxxxx
1
1
    (24000, 24000, 6000, 6000)


xxxxxxxxxx
4
1
def convert(lang, tensor):
2
  for t in tensor:
3
    if t!=0:
4
      print ("%d ----> %s" % (t, lang.index_word[t]))


xxxxxxxxxx
5
1
print ("Input Language; index to word mapping")
2
convert(inp_lang, input_tensor_train[0])
3
print ()
4
print ("Target Language; index to word mapping")
5
convert(targ_lang, target_tensor_train[0])


xxxxxxxxxx
19
1
    Input Language; index to word mapping
2
    1 ----> <start>
3
    8 ----> no
4
    38 ----> puedo
5
    804 ----> confiar
6
    20 ----> en
7
    1000 ----> vosotras
8
    3 ----> .
9
    2 ----> <end>
10
    
11
    Target Language; index to word mapping
12
    1 ----> <start>
13
    4 ----> i
14
    25 ----> can
15
    12 ----> t
16
    345 ----> trust
17
    6 ----> you
18
    3 ----> .
19
    2 ----> <end>

1.2. 创建 `tf.data` 数据集


xxxxxxxxxx
10
1
BUFFER_SIZE = len(input_tensor_train)
2
BATCH_SIZE = 64
3
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
4
embedding_dim = 256
5
units = 1024
6
vocab_inp_size = len(inp_lang.word_index)+1
7
vocab_tar_size = len(targ_lang.word_index)+1
8
9
dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
10
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)


xxxxxxxxxx
2
1
example_input_batch, example_target_batch = next(iter(dataset))
2
example_input_batch.shape, example_target_batch.shape


xxxxxxxxxx
1
1
    (TensorShape([64, 16]), TensorShape([64, 11]))

2. 编写编码器和解码器模型

我们将实现一个使用注意力机制的编码器-解码器模型，您可以在TensorFlow 神经机器翻译（seq2seq）教程中阅读。此示例使用更新的API集，实现了seq2seq教程中的注意方程式。下图显示了每个输入单词由注意机制分配权重，然后解码器使用该权重来预测句子中的下一个单词。

attention mechanism

通过编码器模型输入，该模型给出了形状 (batch_size, max_length, hidden_size) 的编码器输出和形状 (batch_size, hidden_size) 的编码器隐藏状态。

下面是实现的方程:

attention equation 0

attention equation 1

我们用的是 Bahdanau attention 。在写出简化形式之前，我们先来定义符号:

FC = Fully connected (dense) layer 完全连接（密集）层
EO = Encoder output 编码器输出
H = hidden state 隐藏的状态
X = input to the decoder 输入到解码器

定义伪代码：

score = FC(tanh(FC(EO) + FC(H)))
attention weights = softmax(score, axis = 1). 默认情况下Softmax应用于最后一个轴，但是我们要在 第一轴 上应用它，因为得分的形状是 (batch_size, max_length, hidden_size) 。Max_length 是我们输入的长度。由于我们尝试为每个输入分配权重，因此应在该轴上应用softmax。
context vector = sum(attention weights * EO, axis = 1). 选择轴为1的原因与上述相同。
embedding output = 译码器X的输入通过嵌入层传递
merged vector = concat(embedding output, context vector)
将该合并的矢量提供给GRU

每个步骤中所有向量的形状都已在代码中的注释中指定：


xxxxxxxxxx
18
1
class Encoder(tf.keras.Model):
2
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
3
    super(Encoder, self).__init__()
4
    self.batch_sz = batch_sz
5
    self.enc_units = enc_units
6
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
7
    self.gru = tf.keras.layers.GRU(self.enc_units,
8
                                   return_sequences=True,
9
                                   return_state=True,
10
                                   recurrent_initializer='glorot_uniform')
11
12
  def call(self, x, hidden):
13
    x = self.embedding(x)
14
    output, state = self.gru(x, initial_state = hidden)
15
    return output, state
16
17
  def initialize_hidden_state(self):
18
    return tf.zeros((self.batch_sz, self.enc_units))


xxxxxxxxxx
7
1
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
2
3
# sample input
4
sample_hidden = encoder.initialize_hidden_state()
5
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
6
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
7
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))


xxxxxxxxxx
2
1
    Encoder output shape: (batch size, sequence length, units) (64, 16, 1024)
2
    Encoder Hidden state shape: (batch size, units) (64, 1024)


xxxxxxxxxx
26
1
class BahdanauAttention(tf.keras.Model):
2
  def __init__(self, units):
3
    super(BahdanauAttention, self).__init__()
4
    self.W1 = tf.keras.layers.Dense(units)
5
    self.W2 = tf.keras.layers.Dense(units)
6
    self.V = tf.keras.layers.Dense(1)
7
8
  def call(self, query, values):
9
    # hidden shape == (batch_size, hidden size)
10
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
11
    # we are doing this to perform addition to calculate the score
12
    hidden_with_time_axis = tf.expand_dims(query, 1)
13
14
    # score shape == (batch_size, max_length, hidden_size)
15
    score = self.V(tf.nn.tanh(
16
        self.W1(values) + self.W2(hidden_with_time_axis)))
17
18
    # attention_weights shape == (batch_size, max_length, 1)
19
    # we get 1 at the last axis because we are applying score to self.V
20
    attention_weights = tf.nn.softmax(score, axis=1)
21
22
    # context_vector shape after sum == (batch_size, hidden_size)
23
    context_vector = attention_weights * values
24
    context_vector = tf.reduce_sum(context_vector, axis=1)
25
26
    return context_vector, attention_weights


xxxxxxxxxx
5
1
attention_layer = BahdanauAttention(10)
2
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)
3
4
print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
5
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))


xxxxxxxxxx
2
1
    Attention result shape: (batch size, units) (64, 1024)
2
    Attention weights shape: (batch_size, sequence_length, 1) (64, 16, 1)


xxxxxxxxxx
35
1
class Decoder(tf.keras.Model):
2
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
3
    super(Decoder, self).__init__()
4
    self.batch_sz = batch_sz
5
    self.dec_units = dec_units
6
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
7
    self.gru = tf.keras.layers.GRU(self.dec_units,
8
                                   return_sequences=True,
9
                                   return_state=True,
10
                                   recurrent_initializer='glorot_uniform')
11
    self.fc = tf.keras.layers.Dense(vocab_size)
12
13
    # used for attention
14
    self.attention = BahdanauAttention(self.dec_units)
15
16
  def call(self, x, hidden, enc_output):
17
    # enc_output shape == (batch_size, max_length, hidden_size)
18
    context_vector, attention_weights = self.attention(hidden, enc_output)
19
20
    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
21
    x = self.embedding(x)
22
23
    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
24
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
25
26
    # passing the concatenated vector to the GRU
27
    output, state = self.gru(x)
28
29
    # output shape == (batch_size * 1, hidden_size)
30
    output = tf.reshape(output, (-1, output.shape[2]))
31
32
    # output shape == (batch_size, vocab)
33
    x = self.fc(output)
34
35
    return x, state, attention_weights


xxxxxxxxxx
6
1
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)
2
3
sample_decoder_output, _, _ = decoder(tf.random.uniform((64, 1)),
4
                                      sample_hidden, sample_output)
5
6
print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))


xxxxxxxxxx
1
1
    Decoder output shape: (batch_size, vocab size) (64, 4935)

3. 定义优化器和损失函数


xxxxxxxxxx
12
1
optimizer = tf.keras.optimizers.Adam()
2
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
3
    from_logits=True, reduction='none')
4
5
def loss_function(real, pred):
6
  mask = tf.math.logical_not(tf.math.equal(real, 0))
7
  loss_ = loss_object(real, pred)
8
9
  mask = tf.cast(mask, dtype=loss_.dtype)
10
  loss_ *= mask
11
12
  return tf.reduce_mean(loss_)

4. Checkpoints检查点（基于对象的保存）


xxxxxxxxxx
5
1
checkpoint_dir = './training_checkpoints'
2
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
3
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
4
                                 encoder=encoder,
5
                                 decoder=decoder)

5. 训练

通过编码器传递输入，编码器返回编码器输出和编码器隐藏状态。
编码器输出，编码器隐藏状态和解码器输入（它是开始标记）被传递给解码器。
解码器返回预测和解码器隐藏状态。
然后将解码器隐藏状态传递回模型，并使用预测来计算损失。
使用 teacher forcing 决定解码器的下一个输入。
Teacher forcing 是将目标字作为下一个输入传递给解码器的技术。
最后一步是计算梯度并将其应用于优化器并反向传播。


xxxxxxxxxx
30
1
@tf.function
2
def train_step(inp, targ, enc_hidden):
3
  loss = 0
4
5
  with tf.GradientTape() as tape:
6
    enc_output, enc_hidden = encoder(inp, enc_hidden)
7
8
    dec_hidden = enc_hidden
9
10
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
11
12
    # Teacher forcing - feeding the target as the next input
13
    for t in range(1, targ.shape[1]):
14
      # passing enc_output to the decoder
15
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
16
17
      loss += loss_function(targ[:, t], predictions)
18
19
      # using teacher forcing
20
      dec_input = tf.expand_dims(targ[:, t], 1)
21
22
  batch_loss = (loss / int(targ.shape[1]))
23
24
  variables = encoder.trainable_variables + decoder.trainable_variables
25
26
  gradients = tape.gradient(loss, variables)
27
28
  optimizer.apply_gradients(zip(gradients, variables))
29
30
  return batch_loss


xxxxxxxxxx
23
1
EPOCHS = 10
2
3
for epoch in range(EPOCHS):
4
  start = time.time()
5
6
  enc_hidden = encoder.initialize_hidden_state()
7
  total_loss = 0
8
9
  for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
10
    batch_loss = train_step(inp, targ, enc_hidden)
11
    total_loss += batch_loss
12
13
    if batch % 100 == 0:
14
        print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
15
                                                     batch,
16
                                                     batch_loss.numpy()))
17
  # saving (checkpoint) the model every 2 epochs
18
  if (epoch + 1) % 2 == 0:
19
    checkpoint.save(file_prefix = checkpoint_prefix)
20
21
  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
22
                                      total_loss / steps_per_epoch))
23
  print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))


xxxxxxxxxx
7
1
    ......    
2
    Epoch 10 Batch 0 Loss 0.1219
3
    Epoch 10 Batch 100 Loss 0.1374
4
    Epoch 10 Batch 200 Loss 0.1084
5
    Epoch 10 Batch 300 Loss 0.0994
6
    Epoch 10 Loss 0.1088
7
    Time taken for 1 epoch 29.2324090004 sec

6. 翻译

评估函数类似于训练循环，除了我们在这里不使用 teacher forcing 。解码器在每个时间步长的输入是其先前的预测，以及隐藏状态和编码器的输出。
停止预测模型何时预测结束标记。
并存储每个时间步的注意力。

注意：编码器输出仅针对一个输入计算一次。


xxxxxxxxxx
39
1
def evaluate(sentence):
2
    attention_plot = np.zeros((max_length_targ, max_length_inp))
3
4
    sentence = preprocess_sentence(sentence)
5
6
    inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
7
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
8
                                                           maxlen=max_length_inp,
9
                                                           padding='post')
10
    inputs = tf.convert_to_tensor(inputs)
11
12
    result = ''
13
14
    hidden = [tf.zeros((1, units))]
15
    enc_out, enc_hidden = encoder(inputs, hidden)
16
17
    dec_hidden = enc_hidden
18
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)
19
20
    for t in range(max_length_targ):
21
        predictions, dec_hidden, attention_weights = decoder(dec_input,
22
                                                             dec_hidden,
23
                                                             enc_out)
24
25
        # storing the attention weights to plot later on
26
        attention_weights = tf.reshape(attention_weights, (-1, ))
27
        attention_plot[t] = attention_weights.numpy()
28
29
        predicted_id = tf.argmax(predictions[0]).numpy()
30
31
        result += targ_lang.index_word[predicted_id] + ' '
32
33
        if targ_lang.index_word[predicted_id] == '<end>':
34
            return result, sentence, attention_plot
35
36
        # the predicted ID is fed back into the model
37
        dec_input = tf.expand_dims([predicted_id], 0)
38
39
    return result, sentence, attention_plot


xxxxxxxxxx
12
1
# function for plotting the attention weights
2
def plot_attention(attention, sentence, predicted_sentence):
3
    fig = plt.figure(figsize=(10,10))
4
    ax = fig.add_subplot(1, 1, 1)
5
    ax.matshow(attention, cmap='viridis')
6
7
    fontdict = {'fontsize': 14}
8
9
    ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
10
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)
11
12
    plt.show()


xxxxxxxxxx
8
1
def translate(sentence):
2
    result, sentence, attention_plot = evaluate(sentence)
3
4
    print('Input: %s' % (sentence))
5
    print('Predicted translation: {}'.format(result))
6
7
    attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
8
    plot_attention(attention_plot, sentence.split(' '), result.split(' '))

7. 恢复最新的检查点并进行测试


xxxxxxxxxx
2
1
# restoring the latest checkpoint in checkpoint_dir
2
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))


xxxxxxxxxx
1
1
translate(u'hace mucho frio aqui.')


xxxxxxxxxx
2
1
    Input: <start> hace mucho frio aqui . <end>
2
    Predicted translation: it s very cold here . <end>

png


xxxxxxxxxx
1
1
translate(u'esta es mi vida.')


xxxxxxxxxx
2
1
    Input: <start> esta es mi vida . <end>
2
    Predicted translation: this is my life . <end>

png


xxxxxxxxxx
1
1
translate(u'¿todavia estan en casa?')


xxxxxxxxxx
2
1
    Input: <start> ¿ todavia estan en casa ? <end>
2
    Predicted translation: are you still at home ? <end>

png


xxxxxxxxxx
2
1
# wrong translation
2
translate(u'trata de averiguarlo.')


xxxxxxxxxx
2
1
    Input: <start> trata de averiguarlo . <end>
2
    Predicted translation: try to figure it out . <end>

png

8. 下一步

下载不同的数据集以试验翻译，例如，英语到德语，或英语到法语。
尝试对更大的数据集进行训练，或使用更多的迭代周期

最新版本：https://www.mashangxue123.com/tensorflow/tf2-tutorials-text-nmt_with_attention.html 英文版本：https://tensorflow.google.cn/beta/tutorials/text/nmt_with_attention 翻译建议PR：https://github.com/mashangxue/tensorflow2-zh/edit/master/r2/tutorials/text/nmt_with_attention.md