thumt package¶

Submodules¶

thumt.binmt module¶

class thumt.binmt.BiRNNsearch(config)¶

Bases: thumt.nmt.model

The bidirectional RNNsearch model used for semi-supervised training.

build()¶: Building the computational graph.

get_addition_grads(cost0, cost1, now_s)¶: Updating the total cost of bidirectional NMT.

get_inputs_batch(xp, yp, xm, ym)¶

Getting a batch for semi-supervised training.

Parameters:	xp (numpy array) – the indexed source sentences in parallel corpus yp (numpy array) – the indexed target sentences in parallel corpus xm (numpy array) – the indexed source sentences in monolingual corpus ym (numpy array) – the indexed target sentences in monolingual corpus

is_valid(sents, eos, unk)¶

Validating the sentences.

Parameters:	sents (theano variable) – the indexed sentences eos (int) – the index of end-of-sentence symbol

sample(x, length, n_samples=1)¶

Sampling with source-to-target network.

Parameters:	x (numpy array) – the indexed source sentence length (int) – the length limit of samples n_samples (int) – number of samples
Returns:	a numpy array, the indexed sample results

sample_inv(x, length, n_samples=1)¶

Sampling with target-to-source network.

Parameters:	x (numpy array) – the indexed target sentence length (int) – the length limit of samples n_samples (int) – number of samples
Returns:	a numpy array, the indexed sample results

translate(x, beam_size=10)¶

Beam search with source-to-target network.

Parameters:	x (numpy array) – the indexed source sentence beam_size (int) – beam size
Returns:	a numpy array, the indexed translation result

translate_inv(x, beam_size=10)¶

Beam search with target-to-source network.

Parameters:	x (numpy array) – the indexed target sentence beam_size (int) – beam size
Returns:	a numpy array, the indexed translation result

thumt.data module¶

class thumt.data.DataCollection(config, train=True)¶

Bases: object

The data manager. It also reserve the training status.

Parameters:	config (dict) – the configuration train (bool) – Only set to true on training. If true, the vocabulary and corpus will be loaded, and the training status will be recorded.

encode_vocab(encoding=’utf-8’)¶: Change the encoding of the vocabulary.

index_word_target(index)¶

Get the target word given index.

Parameters:	index (int) – the word index
Returns:	string, the corresponding word.

last_improved(last=False)¶

Parameters:	last (bool) – if True, considering getting the same result as improved. And vice versa.
Returns:	int. The number of iteration passed after the latest improvement

load_data()¶: Load training corpus.

load_data_mono()¶: Load monolingual training courpus. Only used in semi-supervised training.

load_status(path)¶

Load the training status from file.

Parameters:	path (string) – the path to a file

load_vocab()¶: Load the vocabulary.

next()¶

Get the next batch of training corpus

Returns:	x, y are 2-D numpy arrays, each row contains an indexed source/target sentence

next_mono()¶

Get the next batch of monolingual training corpus. Only used in semi-supervised training.

Returns:	x, y are 2-D numpy arrays, each row contains an indexed source/target sentence

print_sentence(sentence, vocab, index_eos)¶

get the text form of a sentence represented by an index vector.

Parameters:	sentence (numpy array) – indexed sentence. size:(length, 1) vocab (list) – vocabulary index_eos (int) – the index of the end-of-sentence symbol
Returns:	string, the text form of the sentence

print_source(sentence)¶

Print a source sentence represented by an index vector.

Parameters:	sentence (numpy array) – indexed sentence. size:(length, 1)
Returns:	string, the text form of the source sentence

print_target(sentence)¶

Print a target sentence represented by an index vector.

Parameters:	sentence (numpy array) – indexed sentence. size:(length, 1)
Returns:	string, the text form of the target sentence

save_status(path)¶

Save the training status to file.

Parameters:	path (string) – the path to a file

toindex(sentence, ivocab, index_unk, index_eos)¶

Transform a sentence text to indexed sentence.

Parameters:	sentence (string) – sentence text ivocab (dict) – the vocabulary to index indexed_unk (int) – the index of unknown word symbol index_eos (int) – the index of end-of-sentence symbol
Returns:	numpy array, the indexed sentence

toindex_source(sentence)¶

Transform a source language word list to index list.

Parameters:	sentence (string) – sentence text
Returns:	numpy array, the indexed source sentence

toindex_target(sentence)¶

Transform a target language word list to index list.

Parameters:	sentence (string) – sentence text
Returns:	numpy array, the indexed target sentence

thumt.data.getbatch(lx, ly, config)¶

Get a batch for training.

Parameters:	lx (numpy array) – 2-D numpy arrays, each row contains an indexed source sentence ly (numpy array) – 2-D numpy arrays, each row contains an indexed target sentence config (dict) – the configuration

thumt.layer module¶

class thumt.layer.FeedForwardLayer(name, dim_in, dim_out, active=<theano.tensor.elemwise.Elemwise object>, offset=False)¶

Bases: thumt.layer.Layer

A single-layer feed-forward neural network.

Parameters:	dim_in (int) – the dimension of input vectors dim_out (int) – the dimension of output vectors active (function) – the activation function offset (bool) – if true, this layer will contain bias

forward(state_in)¶

Build the computational graph.

Parameters:	state_in (theano variable) – the input state

class thumt.layer.GatedRecurrentLayer(name, dim_in, dim, active=<theano.tensor.elemwise.Elemwise object>, verbose=False)¶

Bases: thumt.layer.Layer

The gated recurrent layer used to encode the source sentences.

Parameters:	dim_in (int) – the number of input units dim (int) – the number of hidden state units active (function) – the activation function verbose (bool) – only set to True on visualization

forward(emb_in, length, state_init=None, batch_size=1, mask=None)¶

Build the computational graph which computes the hidden states.

Parameters:	emb_in (theano variable) – the input word embeddings length (theano variable) – the length of the input batch_size (int) – the batch size mask (theano variable) – indicate the length of each sequence in one batch

forward_step(state_before, state_in, gate_in, reset_in, mask=None)¶

Build the one-step computational graph which computes the next hidden state.

Parameters:	state_before (theano variable) – The previous hidden state state_in (theano variable) – the input state gate_in (theano variable) – the input to update gate reset_in (theano variable) – the input to reset gate mask (theano variable) – indicate the length of each sequence in one batch

class thumt.layer.GatedRecurrentLayer_attention(name, dim_in, dim_c, dim, dim_class, active=<theano.tensor.elemwise.Elemwise object>, maxout=2, verbose=False)¶

Bases: thumt.layer.Layer

The gated recurrent layer with attention mechanism, used as decoder.

Parameters:	dim_in (int) – the number of input units dim_c (int) – the number of context units dim (int) – the number of hidden units dim_class (int) – the number of target vocabulary active (function) – the activation function maxout (int) – the number of maxout parts verbose (bool) – only set to True on visualization

decode_next(c, state, emb)¶

Get the next hidden state. Used in beam search and sampling.

Parameters:	c (theano variable) – the current context reading state (theano variable) – the last hidden state emb (theano variable) – the embedding of the last generated word

decode_probs(context, state, emb)¶

Get the probability of the next word. Used in beam search and sampling.

Parameters:	context (theano variable) – the context vectors state (theano variable) – the last hidden state emb (theano variable) – the embedding of the last generated word

forward(emb_in, length, context, state_init, batch_size=1, mask=None, cmask=None)¶

Build the computational graph which computes the hidden states.

Parameters:

emb_in (theano variable) – the input word embeddings
length (theano variable) – the length of the input
context (theano variable) – the context vectors
state_init (theano variable) – the inital states
batch_size (int) – the batch size
mask (theano variable) – indicate the length of each sequence in one batch
cmask (theano variable) – indicate the length of each context sequence in one batch

forward_step(state_before, state_in, gate_in, reset_in, context, att_c, mask=None, cmask=None)¶

Build the one-step computational graph which calculates the next hidden state.

Parameters:

state_before (theano variable) – The previous hidden state
state_in (theano variable) – the input state
gate_in (theano variable) – the input to update gate
reset_in (theano variable) – the input to reset gate
mask (theano variable) – indicate the length of each sequence in one batch
cmask (theano variable) – indicate the length of each context sequence in one batch
context (theano variable) – the context vectors
att_c (theano variable) – the attention vector from context

class thumt.layer.Layer¶

Bases: object

The parent class of neural network layers.

class thumt.layer.LayerFactory¶

Bases: object

The factory to build and monitor all neural network layers.

class thumt.layer.LookupTable(name, num, dim_embed, offset=True)¶

Bases: thumt.layer.Layer

A lookup table layer which reserves the word embeddings.

Parameters:	num (int) – the number of words dim_embed (int) – the dimension of the word embedding offset (bool) – if true, this layer will contain bias

forward(index)¶

Build the computational graph.

Parameters:	index (theano variable) – the input state (word indice)

thumt.mrt_utils module¶

thumt.mrt_utils.calBleu(x, ref_dict, lens, ngram)¶

Calculate BLEU score with single reference

Parameters:	x (list) – the indexed hypothesis sentence ref_dict (dict) – the n-gram count generated by getRefDict() lens (int) – the length of the reference ngram (int) – maximum length of counted n-grams

thumt.mrt_utils.cutSen(x, config)¶

Cut the part after the end-of-sentence symbol

Parameters:	x (list) – indexed sentence config (dict) – the configuration

thumt.mrt_utils.getMRTBatch(x, xmask, y, ymask, config, model, data)¶

Get a batch for MRT training

Parameters:

x (numpy array) – the indexed source sentence
xmask (numpy array) – indicate the length of each sequence in source sequence
y (numpy array) – the indexed target sentence
ymask (numpy array) – indicate the length of each sequence in target sequence
config (dict) – the configuration
model (Model) – the NMT model
data (DataCollection) – the data manager

thumt.mrt_utils.getRefDict(words, ngram)¶

Get the count of n-grams in the reference

Parameters:	words (list) – indexed sentence ngram (int) – maximum length of counted n-grams

thumt.mrt_utils.getUnique(samples, y, config, model, data)¶

Remove repeated sentences from sampling results. Then calculate BLEU score for each sentence.

Parameters:	y (numpy array) – the indexed target sentence config (dict) – the configuration model (Model) – the NMT model data (DataCollection) – the data manager

thumt.mrt_utils.getYM(y, config)¶

Get masks which indicate the length of target sentences

Parameters:	y (list) – the indexed sentences config (dict) – the configuration

thumt.nmt module¶

class thumt.nmt.RNNsearch(config, name=”)¶

Bases: thumt.nmt.model

The attention-based NMT model

build(verbose=False)¶

Build the computational graph.

Parameters:	verbose (bool) – only set to True on visualization

decode_sample(state_init, c, length, n_samples)¶

Build the decoder graph for sampling.

Parameters:	state_init (theano variables) – the initial state of decoder c (theano variables) – the context vectors length (int) – the limitation of sample length n_samples (int) – the number of samples

encode(x)¶: Encode source sentence to context vector.

get_attention(x, xmask, y, ymask)¶: Get the attention weight of parallel sentences.

get_context_and_init(x)¶: Encode source sentence to context vectors and get the initial decoder hidden state.

get_cost(x, xmask, y, ymask)¶: Get the negative log-likelihood of parallel sentences.

get_init(c)¶: Get the initial decoder hidden state with context vector.

get_layer(x, xmask, y, ymask)¶: Get the hidden states essential for visualization

get_next(c, state, emb)¶: Get the next hidden state.

get_probs(c, state, emb)¶: Get the probability of the next target word.

get_sample(x, length, n_samples)¶: Get sampling results.

get_trg_embedding(y)¶: Get the embedding of target sentence.

sampling_step(state, prev, context)¶

Build the computational graph which samples the next word.

Parameters:	state (theano variables) – the previous hidden state prev (theano variables) – the last generated word context (theano variables) – the context vectors.

class thumt.nmt.model¶

Bases: object

The parent class of NMT models

load(path, decode=False)¶

Load the model from npz format. It will load from the checkpoint model. If checkpoint model does not exist, it will initialize a new model (MLE) or load from given model (MRT or semi)

Parameters:	path (string) – the path to a file decode (bool) – Set to True only on decoding

sample(x, length, n_samples=1)¶

Sample with probability.

Parameters:	x (numpy array) – the indexed source sentence length (int) – the length limit of samples n_samples (int) – number of samples
Returns:	a numpy array, the indexed sample results

save(path, data=None, mapping=None)¶

Save the model in npz format.

Parameters:	path (string) – the path to a file data (DataCollection) – the data manager, will save the vocabulary into the model if set. mapping (dict) – the mapping file used in UNKreplace, will save it to the model if set

translate(x, beam_size=10, return_array=False)¶

Decode with beam search.

Parameters:	x (numpy array) – the indexed source sentence beam_size (int) – beam size
Returns:	a numpy array, the indexed translation result

thumt.preprocess module¶

thumt.preprocess.preprocess(num, vocab_f, ivocab_f, input_f, output_f, data_vocab=’cPickle’, data_corpus=’json’, withdict=False, fromgr=False)¶

Count the most frequent words in the training corpus. Then, generate the vocabulary file and index the corpus.

Parameters:	vocab_f (string) – the path to vocabulary file ivocab_f (string) – the path to vocabulary to index file input_f (string) – the path to corpus (text) file output_f (string) – the path to indexed corpus file withdict (bool) – if set to True, vocabulary will be loaded from existing file instead.

thumt.preprocess.shuffle(src, trg, src_shuf, trg_shuf, data_corpus=’json’)¶

Randomly shuffling the parallel corpus

Parameters:	src (string) – the path to indexed source corpus file trg (string) – the path to indexed target corpus file src_shuf (string) – the path to shuffled indexed source corpus file trg_shuf (string) – the path to shuffled indexed target corpus file

thumt.preprocess.shuffle_mono(src, src_shuf, data_corpus=’json’)¶

Randomly shuffling the monolingual corpus.

Parameters:	src (string) – the path to indexed corpus file src_shuf (string) – the path to indexed corpus file

thumt.tools module¶

thumt.tools.bleu(hypo_c, refs_c, n)¶

Calculate BLEU score given translation and references.

Parameters:	hypo_c (string) – the translations refs_c (list) – the list of references n (int) – maximum length of counted n-grams

thumt.tools.bleu_file(hypo, refs)¶

Calculate the BLEU score given translation files and reference files.

Parameters:	hypo (string) – the path to translation file refs (list) – the list of path to reference files

thumt.tools.clip(grads, threshold, square=True, params=None)¶

Build the computational graph that clips the gradient if the norm of the gradient exceeds the threshold.

Parameters:	grads (theano variable) – the gradient to be clipped threshold (float) – the threshold of the norm of the gradient
Returns:	theano variable. The clipped gradient.

thumt.tools.cut_sentence(sentence, index_eos)¶

Cut the sentence after the end-of-sentence symbol.

Parameters:	sentence (numpy array) – the indexed sentence index_eos (int) – the index of end-of-sentence symbol

thumt.tools.dot3d(input, weight)¶

Build the computational graph of 3-d matrix multiply operation.

Parameters:	input (theano variable) – the input variable weight (theano variable) – the weight parameter

thumt.tools.duplicate(input, times)¶

Broadcast a 2-D tensor given times on axis 1.

Parameters:	input (theano variable) – the input variable

thumt.tools.get_ref_files(ref)¶

Get the list of reference files by prefix. Suppose nist02.en0, nist02.en1, nist02.en2, nist02.en3 are references and nist02.en does not exist, then get_ref_files(“nist02.en”) = [“nist02.en0”, “nist02.en1”, “nist02.en2”, “nist02.en3”]

Parameters:	ref (string) – the prefix of reference files

thumt.tools.init_bias(size, name, scale=0.01, shared=True)¶

Initialize bias paramater in neural networks.

Parameters:	size (tuple or list) – the size of the parameter name (string) – the name of the parameter scale (float) – the scale of the parameter

thumt.tools.init_weight(size, name, scale=0.01, shared=True)¶

Randomly initialize weight parameter in neural networks.

Parameters:	size (tuple or list) – the size of the parameter name (string) – the name of the parameter scale (float) – the scale of the parameter

thumt.tools.init_zeros(size, shared=True)¶

Initialize a zero matrix.

Parameters:	size (tuple or list) – the size of the matirx

thumt.tools.maxout(input, max_num=2)¶

Build the computational graph of maxout operation.

Parameters:	input (theano variable) – the input variable max_num (int) – the number of maxout parts

thumt.tools.merge_dict(d1, d2)¶: Merge two dicts. The count of each item is the maximum count in two dicts.

thumt.tools.padzero(input)¶

Build the computational graph that pads zeros to the left of the input.

Parameters:	input (theano variable) – the input variable

thumt.tools.print_time(time)¶

Parameters:	time (float) – the number of seconds
Returns:	string, the text format of time

thumt.tools.sentence2dict(sentence, n)¶

Count the number of n-grams in a sentence.

Parameters:	sentence (string) – sentence text n (int) – maximum length of counted n-grams

thumt.tools.shift_one(input)¶

Add a zero vector to the left side of input and remove the rightmost vector.

Parameters:	input (theano variable) – the input variable

thumt.tools.softmax(energy, axis=1)¶

The softmax operation.

Parameters:	energy (theano variable) – the energy value for each class

thumt.tools.softmax3d(input)¶

Build the computational graph of the softmax operation.

Parameters:	input (theano variable) – the input variable

thumt.lrp module¶

class thumt.lrp.BackEncoderVal(word_num, config)¶

Bases: object

The class which stores the intermediate variables produced by backward encoder in NMT mdoel

Parameters:	word_num (int) – the length of source sentence config (dict) – the configuration file

readData(filename)¶

Load intermediate variables produced by backward encoder in NMT model

Parameters:	filename (string) – the file which stores intermediate variables

class thumt.lrp.DecoderVal(src_word_num, trg_word_num, config)¶

Bases: object

The class which stores the intermediate variables produced by decoder in NMT mdoel

Parameters:	src_word_num (int) – the length of source sentence trg_word_num (int) – the length of target sentence config (dict) – the configuration file

readData(filename)¶

Load intermediate variables produced by decoder in NMT model

Parameters:	filename (string) – the file which stores intermediate variables

class thumt.lrp.EncoderVal(word_num, config)¶

Bases: object

The class which stores the intermediate variables produced by forward encoder in NMT mdoel

Parameters:	word_num (int) – the length of source sentence config (dict) – the configuration file

readData(filename)¶

Load intermediate variables produced by forward encoder in NMT model

Parameters:	filename (string) – the file which stores intermediate variables

class thumt.lrp.Model(src_word_num, trg_word_num, config)¶

Bases: object

The class which calculates the relevance

Parameters:	src_word_num (int) – the length of source sentence trg_word_num (int) – the length of target sentence config (dict) – the configuration file

cal_back_encoder()¶

Calculate the relevance in backward encoder

Returns:	4-D numpy array, the relevance between backward encoder hidden states and x

cal_decoder(R_x)¶

Calculate the relevance in decoder

Parameters:	R_x (theano sharedVariable) – the relevance bewteen inputs and the hidden states of encoder.
Returns:	R_c_x,R_h_x,R_o_x,R_h_y,R_o_y are numpy arrays, they are relevance between context and x, relevance bewteen decoder hidden states and x, relevance bewteen readout and x, relevance bewteen decoder hidden states and y, relevance bewteen readout and y

cal_decoder_step(decoder_val)¶

Calculate the weight ratios in decoder

Parameters:	decoder_val (class) – the class which stores the intermediate variables in decoder
Returns:	R_h_h, R_h_x, R_h_y, R_outenergy_2_h, R_outenergy_2_x, R_outenergy_2_y_before are theano variables, weight ratios in decoder.

cal_encoder()¶

Calculate the relevance in forward encoder

Returns:	4-D numpy array, the relevance between forward encoder hidden states and x

cal_encoder_step(encoder_val)¶

Calculate the weight ratios in encoder

Parameters:	decoder_val (class) – the class which stores the intermediate variables in encoder
Returns:	R_h_x, R_h_h are theano variables, weight ratios in encoder

readData(param_filename, val_filename)¶

Load the parameters of NMT models and intermediate variables

Parameters:	param_filename (string) – the file which stores the parameters of NMT models val_filename (string) – the file which stores intermediate variables

thumt.lrp.init_idx(size, name, shared=True)¶: Initialize words indexes in sentence

thumt.lrp.init_weight(size, name, shared=True)¶: Initialize weight matrice

thumt.lrp.normalize(R, f)¶

Normalize the relevance vector and write to result into output file

Parameters:	R (4-D numpy array) – the relevance vector to be normalized f (file) – output file

thumt.lrp.save(att, f)¶

write attentions into output file

Parameters:	att (2-D numpy array) – attention information f (file) – output file

thumt package¶

Submodules¶

thumt.binmt module¶

thumt.data module¶

thumt.layer module¶

thumt.mrt_utils module¶

thumt.nmt module¶

thumt.preprocess module¶

thumt.tools module¶

thumt.lrp module¶

Module contents¶

Table Of Contents

Previous topic

This Page