thumt package

Submodules

thumt.binmt module

class thumt.binmt.BiRNNsearch(config)

Bases: thumt.nmt.model

The bidirectional RNNsearch model used for semi-supervised training.

build()

Building the computational graph.

get_addition_grads(cost0, cost1, now_s)

Updating the total cost of bidirectional NMT.

get_inputs_batch(xp, yp, xm, ym)

Getting a batch for semi-supervised training.

Parameters:
  • xp (numpy array) – the indexed source sentences in parallel corpus
  • yp (numpy array) – the indexed target sentences in parallel corpus
  • xm (numpy array) – the indexed source sentences in monolingual corpus
  • ym (numpy array) – the indexed target sentences in monolingual corpus
is_valid(sents, eos, unk)

Validating the sentences.

Parameters:
  • sents (theano variable) – the indexed sentences
  • eos (int) – the index of end-of-sentence symbol
sample(x, length, n_samples=1)

Sampling with source-to-target network.

Parameters:
  • x (numpy array) – the indexed source sentence
  • length (int) – the length limit of samples
  • n_samples (int) – number of samples
Returns:

a numpy array, the indexed sample results

sample_inv(x, length, n_samples=1)

Sampling with target-to-source network.

Parameters:
  • x (numpy array) – the indexed target sentence
  • length (int) – the length limit of samples
  • n_samples (int) – number of samples
Returns:

a numpy array, the indexed sample results

translate(x, beam_size=10)

Beam search with source-to-target network.

Parameters:
  • x (numpy array) – the indexed source sentence
  • beam_size (int) – beam size
Returns:

a numpy array, the indexed translation result

translate_inv(x, beam_size=10)

Beam search with target-to-source network.

Parameters:
  • x (numpy array) – the indexed target sentence
  • beam_size (int) – beam size
Returns:

a numpy array, the indexed translation result

thumt.data module

class thumt.data.DataCollection(config, train=True)

Bases: object

The data manager. It also reserve the training status.

Parameters:
  • config (dict) – the configuration
  • train (bool) – Only set to true on training. If true, the vocabulary and corpus will be loaded, and the training status will be recorded.
encode_vocab(encoding=’utf-8’)

Change the encoding of the vocabulary.

index_word_target(index)

Get the target word given index.

Parameters:index (int) – the word index
Returns:string, the corresponding word.
last_improved(last=False)
Parameters:last (bool) – if True, considering getting the same result as improved. And vice versa.
Returns:int. The number of iteration passed after the latest improvement
load_data()

Load training corpus.

load_data_mono()

Load monolingual training courpus. Only used in semi-supervised training.

load_status(path)

Load the training status from file.

Parameters:path (string) – the path to a file
load_vocab()

Load the vocabulary.

next()

Get the next batch of training corpus

Returns:x, y are 2-D numpy arrays, each row contains an indexed source/target sentence
next_mono()

Get the next batch of monolingual training corpus. Only used in semi-supervised training.

Returns:x, y are 2-D numpy arrays, each row contains an indexed source/target sentence
print_sentence(sentence, vocab, index_eos)

get the text form of a sentence represented by an index vector.

Parameters:
  • sentence (numpy array) – indexed sentence. size:(length, 1)
  • vocab (list) – vocabulary
  • index_eos (int) – the index of the end-of-sentence symbol
Returns:

string, the text form of the sentence

print_source(sentence)

Print a source sentence represented by an index vector.

Parameters:sentence (numpy array) – indexed sentence. size:(length, 1)
Returns:string, the text form of the source sentence
print_target(sentence)

Print a target sentence represented by an index vector.

Parameters:sentence (numpy array) – indexed sentence. size:(length, 1)
Returns:string, the text form of the target sentence
save_status(path)

Save the training status to file.

Parameters:path (string) – the path to a file
toindex(sentence, ivocab, index_unk, index_eos)

Transform a sentence text to indexed sentence.

Parameters:
  • sentence (string) – sentence text
  • ivocab (dict) – the vocabulary to index
  • indexed_unk (int) – the index of unknown word symbol
  • index_eos (int) – the index of end-of-sentence symbol
Returns:

numpy array, the indexed sentence

toindex_source(sentence)

Transform a source language word list to index list.

Parameters:sentence (string) – sentence text
Returns:numpy array, the indexed source sentence
toindex_target(sentence)

Transform a target language word list to index list.

Parameters:sentence (string) – sentence text
Returns:numpy array, the indexed target sentence
thumt.data.getbatch(lx, ly, config)

Get a batch for training.

Parameters:
  • lx (numpy array) – 2-D numpy arrays, each row contains an indexed source sentence
  • ly (numpy array) – 2-D numpy arrays, each row contains an indexed target sentence
  • config (dict) – the configuration

thumt.layer module

class thumt.layer.FeedForwardLayer(name, dim_in, dim_out, active=<theano.tensor.elemwise.Elemwise object>, offset=False)

Bases: thumt.layer.Layer

A single-layer feed-forward neural network.

Parameters:
  • dim_in (int) – the dimension of input vectors
  • dim_out (int) – the dimension of output vectors
  • active (function) – the activation function
  • offset (bool) – if true, this layer will contain bias
forward(state_in)

Build the computational graph.

Parameters:state_in (theano variable) – the input state
class thumt.layer.GatedRecurrentLayer(name, dim_in, dim, active=<theano.tensor.elemwise.Elemwise object>, verbose=False)

Bases: thumt.layer.Layer

The gated recurrent layer used to encode the source sentences.

Parameters:
  • dim_in (int) – the number of input units
  • dim (int) – the number of hidden state units
  • active (function) – the activation function
  • verbose (bool) – only set to True on visualization
forward(emb_in, length, state_init=None, batch_size=1, mask=None)

Build the computational graph which computes the hidden states.

Parameters:
  • emb_in (theano variable) – the input word embeddings
  • length (theano variable) – the length of the input
  • batch_size (int) – the batch size
  • mask (theano variable) – indicate the length of each sequence in one batch
forward_step(state_before, state_in, gate_in, reset_in, mask=None)

Build the one-step computational graph which computes the next hidden state.

Parameters:
  • state_before (theano variable) – The previous hidden state
  • state_in (theano variable) – the input state
  • gate_in (theano variable) – the input to update gate
  • reset_in (theano variable) – the input to reset gate
  • mask (theano variable) – indicate the length of each sequence in one batch
class thumt.layer.GatedRecurrentLayer_attention(name, dim_in, dim_c, dim, dim_class, active=<theano.tensor.elemwise.Elemwise object>, maxout=2, verbose=False)

Bases: thumt.layer.Layer

The gated recurrent layer with attention mechanism, used as decoder.

Parameters:
  • dim_in (int) – the number of input units
  • dim_c (int) – the number of context units
  • dim (int) – the number of hidden units
  • dim_class (int) – the number of target vocabulary
  • active (function) – the activation function
  • maxout (int) – the number of maxout parts
  • verbose (bool) – only set to True on visualization
decode_next(c, state, emb)

Get the next hidden state. Used in beam search and sampling.

Parameters:
  • c (theano variable) – the current context reading
  • state (theano variable) – the last hidden state
  • emb (theano variable) – the embedding of the last generated word
decode_probs(context, state, emb)

Get the probability of the next word. Used in beam search and sampling.

Parameters:
  • context (theano variable) – the context vectors
  • state (theano variable) – the last hidden state
  • emb (theano variable) – the embedding of the last generated word
forward(emb_in, length, context, state_init, batch_size=1, mask=None, cmask=None)

Build the computational graph which computes the hidden states.

Parameters:
  • emb_in (theano variable) – the input word embeddings
  • length (theano variable) – the length of the input
  • context (theano variable) – the context vectors
  • state_init (theano variable) – the inital states
  • batch_size (int) – the batch size
  • mask (theano variable) – indicate the length of each sequence in one batch
  • cmask (theano variable) – indicate the length of each context sequence in one batch
forward_step(state_before, state_in, gate_in, reset_in, context, att_c, mask=None, cmask=None)

Build the one-step computational graph which calculates the next hidden state.

Parameters:
  • state_before (theano variable) – The previous hidden state
  • state_in (theano variable) – the input state
  • gate_in (theano variable) – the input to update gate
  • reset_in (theano variable) – the input to reset gate
  • mask (theano variable) – indicate the length of each sequence in one batch
  • cmask (theano variable) – indicate the length of each context sequence in one batch
  • context (theano variable) – the context vectors
  • att_c (theano variable) – the attention vector from context
class thumt.layer.Layer

Bases: object

The parent class of neural network layers.

class thumt.layer.LayerFactory

Bases: object

The factory to build and monitor all neural network layers.

class thumt.layer.LookupTable(name, num, dim_embed, offset=True)

Bases: thumt.layer.Layer

A lookup table layer which reserves the word embeddings.

Parameters:
  • num (int) – the number of words
  • dim_embed (int) – the dimension of the word embedding
  • offset (bool) – if true, this layer will contain bias
forward(index)

Build the computational graph.

Parameters:index (theano variable) – the input state (word indice)

thumt.mrt_utils module

thumt.mrt_utils.calBleu(x, ref_dict, lens, ngram)

Calculate BLEU score with single reference

Parameters:
  • x (list) – the indexed hypothesis sentence
  • ref_dict (dict) – the n-gram count generated by getRefDict()
  • lens (int) – the length of the reference
  • ngram (int) – maximum length of counted n-grams
thumt.mrt_utils.cutSen(x, config)

Cut the part after the end-of-sentence symbol

Parameters:
  • x (list) – indexed sentence
  • config (dict) – the configuration
thumt.mrt_utils.getMRTBatch(x, xmask, y, ymask, config, model, data)

Get a batch for MRT training

Parameters:
  • x (numpy array) – the indexed source sentence
  • xmask (numpy array) – indicate the length of each sequence in source sequence
  • y (numpy array) – the indexed target sentence
  • ymask (numpy array) – indicate the length of each sequence in target sequence
  • config (dict) – the configuration
  • model (Model) – the NMT model
  • data (DataCollection) – the data manager
thumt.mrt_utils.getRefDict(words, ngram)

Get the count of n-grams in the reference

Parameters:
  • words (list) – indexed sentence
  • ngram (int) – maximum length of counted n-grams
thumt.mrt_utils.getUnique(samples, y, config, model, data)

Remove repeated sentences from sampling results. Then calculate BLEU score for each sentence.

Parameters:
  • y (numpy array) – the indexed target sentence
  • config (dict) – the configuration
  • model (Model) – the NMT model
  • data (DataCollection) – the data manager
thumt.mrt_utils.getYM(y, config)

Get masks which indicate the length of target sentences

Parameters:
  • y (list) – the indexed sentences
  • config (dict) – the configuration

thumt.nmt module

class thumt.nmt.RNNsearch(config, name=”)

Bases: thumt.nmt.model

The attention-based NMT model

build(verbose=False)

Build the computational graph.

Parameters:verbose (bool) – only set to True on visualization
decode_sample(state_init, c, length, n_samples)

Build the decoder graph for sampling.

Parameters:
  • state_init (theano variables) – the initial state of decoder
  • c (theano variables) – the context vectors
  • length (int) – the limitation of sample length
  • n_samples (int) – the number of samples
encode(x)

Encode source sentence to context vector.

get_attention(x, xmask, y, ymask)

Get the attention weight of parallel sentences.

get_context_and_init(x)

Encode source sentence to context vectors and get the initial decoder hidden state.

get_cost(x, xmask, y, ymask)

Get the negative log-likelihood of parallel sentences.

get_init(c)

Get the initial decoder hidden state with context vector.

get_layer(x, xmask, y, ymask)

Get the hidden states essential for visualization

get_next(c, state, emb)

Get the next hidden state.

get_probs(c, state, emb)

Get the probability of the next target word.

get_sample(x, length, n_samples)

Get sampling results.

get_trg_embedding(y)

Get the embedding of target sentence.

sampling_step(state, prev, context)

Build the computational graph which samples the next word.

Parameters:
  • state (theano variables) – the previous hidden state
  • prev (theano variables) – the last generated word
  • context (theano variables) – the context vectors.
class thumt.nmt.model

Bases: object

The parent class of NMT models

load(path, decode=False)

Load the model from npz format. It will load from the checkpoint model. If checkpoint model does not exist, it will initialize a new model (MLE) or load from given model (MRT or semi)

Parameters:
  • path (string) – the path to a file
  • decode (bool) – Set to True only on decoding
sample(x, length, n_samples=1)

Sample with probability.

Parameters:
  • x (numpy array) – the indexed source sentence
  • length (int) – the length limit of samples
  • n_samples (int) – number of samples
Returns:

a numpy array, the indexed sample results

save(path, data=None, mapping=None)

Save the model in npz format.

Parameters:
  • path (string) – the path to a file
  • data (DataCollection) – the data manager, will save the vocabulary into the model if set.
  • mapping (dict) – the mapping file used in UNKreplace, will save it to the model if set
translate(x, beam_size=10, return_array=False)

Decode with beam search.

Parameters:
  • x (numpy array) – the indexed source sentence
  • beam_size (int) – beam size
Returns:

a numpy array, the indexed translation result

thumt.preprocess module

thumt.preprocess.preprocess(num, vocab_f, ivocab_f, input_f, output_f, data_vocab=’cPickle’, data_corpus=’json’, withdict=False, fromgr=False)

Count the most frequent words in the training corpus. Then, generate the vocabulary file and index the corpus.

Parameters:
  • vocab_f (string) – the path to vocabulary file
  • ivocab_f (string) – the path to vocabulary to index file
  • input_f (string) – the path to corpus (text) file
  • output_f (string) – the path to indexed corpus file
  • withdict (bool) – if set to True, vocabulary will be loaded from existing file instead.
thumt.preprocess.shuffle(src, trg, src_shuf, trg_shuf, data_corpus=’json’)

Randomly shuffling the parallel corpus

Parameters:
  • src (string) – the path to indexed source corpus file
  • trg (string) – the path to indexed target corpus file
  • src_shuf (string) – the path to shuffled indexed source corpus file
  • trg_shuf (string) – the path to shuffled indexed target corpus file
thumt.preprocess.shuffle_mono(src, src_shuf, data_corpus=’json’)

Randomly shuffling the monolingual corpus.

Parameters:
  • src (string) – the path to indexed corpus file
  • src_shuf (string) – the path to indexed corpus file

thumt.tools module

thumt.tools.bleu(hypo_c, refs_c, n)

Calculate BLEU score given translation and references.

Parameters:
  • hypo_c (string) – the translations
  • refs_c (list) – the list of references
  • n (int) – maximum length of counted n-grams
thumt.tools.bleu_file(hypo, refs)

Calculate the BLEU score given translation files and reference files.

Parameters:
  • hypo (string) – the path to translation file
  • refs (list) – the list of path to reference files
thumt.tools.clip(grads, threshold, square=True, params=None)

Build the computational graph that clips the gradient if the norm of the gradient exceeds the threshold.

Parameters:
  • grads (theano variable) – the gradient to be clipped
  • threshold (float) – the threshold of the norm of the gradient
Returns:

theano variable. The clipped gradient.

thumt.tools.cut_sentence(sentence, index_eos)

Cut the sentence after the end-of-sentence symbol.

Parameters:
  • sentence (numpy array) – the indexed sentence
  • index_eos (int) – the index of end-of-sentence symbol
thumt.tools.dot3d(input, weight)

Build the computational graph of 3-d matrix multiply operation.

Parameters:
  • input (theano variable) – the input variable
  • weight (theano variable) – the weight parameter
thumt.tools.duplicate(input, times)

Broadcast a 2-D tensor given times on axis 1.

Parameters:input (theano variable) – the input variable
thumt.tools.get_ref_files(ref)

Get the list of reference files by prefix. Suppose nist02.en0, nist02.en1, nist02.en2, nist02.en3 are references and nist02.en does not exist, then get_ref_files(“nist02.en”) = [“nist02.en0”, “nist02.en1”, “nist02.en2”, “nist02.en3”]

Parameters:ref (string) – the prefix of reference files
thumt.tools.init_bias(size, name, scale=0.01, shared=True)

Initialize bias paramater in neural networks.

Parameters:
  • size (tuple or list) – the size of the parameter
  • name (string) – the name of the parameter
  • scale (float) – the scale of the parameter
thumt.tools.init_weight(size, name, scale=0.01, shared=True)

Randomly initialize weight parameter in neural networks.

Parameters:
  • size (tuple or list) – the size of the parameter
  • name (string) – the name of the parameter
  • scale (float) – the scale of the parameter
thumt.tools.init_zeros(size, shared=True)

Initialize a zero matrix.

Parameters:size (tuple or list) – the size of the matirx
thumt.tools.maxout(input, max_num=2)

Build the computational graph of maxout operation.

Parameters:
  • input (theano variable) – the input variable
  • max_num (int) – the number of maxout parts
thumt.tools.merge_dict(d1, d2)

Merge two dicts. The count of each item is the maximum count in two dicts.

thumt.tools.padzero(input)

Build the computational graph that pads zeros to the left of the input.

Parameters:input (theano variable) – the input variable
thumt.tools.print_time(time)
Parameters:time (float) – the number of seconds
Returns:string, the text format of time
thumt.tools.sentence2dict(sentence, n)

Count the number of n-grams in a sentence.

Parameters:
  • sentence (string) – sentence text
  • n (int) – maximum length of counted n-grams
thumt.tools.shift_one(input)

Add a zero vector to the left side of input and remove the rightmost vector.

Parameters:input (theano variable) – the input variable
thumt.tools.softmax(energy, axis=1)

The softmax operation.

Parameters:energy (theano variable) – the energy value for each class
thumt.tools.softmax3d(input)

Build the computational graph of the softmax operation.

Parameters:input (theano variable) – the input variable

thumt.lrp module

class thumt.lrp.BackEncoderVal(word_num, config)

Bases: object

The class which stores the intermediate variables produced by backward encoder in NMT mdoel

Parameters:
  • word_num (int) – the length of source sentence
  • config (dict) – the configuration file
readData(filename)

Load intermediate variables produced by backward encoder in NMT model

Parameters:filename (string) – the file which stores intermediate variables
class thumt.lrp.DecoderVal(src_word_num, trg_word_num, config)

Bases: object

The class which stores the intermediate variables produced by decoder in NMT mdoel

Parameters:
  • src_word_num (int) – the length of source sentence
  • trg_word_num (int) – the length of target sentence
  • config (dict) – the configuration file
readData(filename)

Load intermediate variables produced by decoder in NMT model

Parameters:filename (string) – the file which stores intermediate variables
class thumt.lrp.EncoderVal(word_num, config)

Bases: object

The class which stores the intermediate variables produced by forward encoder in NMT mdoel

Parameters:
  • word_num (int) – the length of source sentence
  • config (dict) – the configuration file
readData(filename)

Load intermediate variables produced by forward encoder in NMT model

Parameters:filename (string) – the file which stores intermediate variables
class thumt.lrp.Model(src_word_num, trg_word_num, config)

Bases: object

The class which calculates the relevance

Parameters:
  • src_word_num (int) – the length of source sentence
  • trg_word_num (int) – the length of target sentence
  • config (dict) – the configuration file
cal_back_encoder()

Calculate the relevance in backward encoder

Returns:4-D numpy array, the relevance between backward encoder hidden states and x
cal_decoder(R_x)

Calculate the relevance in decoder

Parameters:R_x (theano sharedVariable) – the relevance bewteen inputs and the hidden states of encoder.
Returns:R_c_x,R_h_x,R_o_x,R_h_y,R_o_y are numpy arrays, they are relevance between context and x, relevance bewteen decoder hidden states and x, relevance bewteen readout and x, relevance bewteen decoder hidden states and y, relevance bewteen readout and y
cal_decoder_step(decoder_val)

Calculate the weight ratios in decoder

Parameters:decoder_val (class) – the class which stores the intermediate variables in decoder
Returns:R_h_h, R_h_x, R_h_y, R_outenergy_2_h, R_outenergy_2_x, R_outenergy_2_y_before are theano variables, weight ratios in decoder.
cal_encoder()

Calculate the relevance in forward encoder

Returns:4-D numpy array, the relevance between forward encoder hidden states and x
cal_encoder_step(encoder_val)

Calculate the weight ratios in encoder

Parameters:decoder_val (class) – the class which stores the intermediate variables in encoder
Returns:R_h_x, R_h_h are theano variables, weight ratios in encoder
readData(param_filename, val_filename)

Load the parameters of NMT models and intermediate variables

Parameters:
  • param_filename (string) – the file which stores the parameters of NMT models
  • val_filename (string) – the file which stores intermediate variables
thumt.lrp.init_idx(size, name, shared=True)

Initialize words indexes in sentence

thumt.lrp.init_weight(size, name, shared=True)

Initialize weight matrice

thumt.lrp.normalize(R, f)

Normalize the relevance vector and write to result into output file

Parameters:
  • R (4-D numpy array) – the relevance vector to be normalized
  • f (file) – output file
thumt.lrp.save(att, f)

write attentions into output file

Parameters:
  • att (2-D numpy array) – attention information
  • f (file) – output file

Module contents