Our earlier claim about Tribe Dynamics’ baseline models is that the model did not leverage the knowledge learned in one language on to another. Aligned word embeddings that we saw earlier were able to achieve it through pre-training and aligning embeddings so that all the classifier learns in English, similar knowledge is encoded into Italian too.
This model attempts to transfer knowledge by basing itself on this fundamental question. Why don’t we translate all the foreign text to English and use our base classifier’s core competency in English?
Translation at sentence level requires a neural or statistical machine translation system which has operational costs. Given that our baseline classifier uses bag-of-words assumption which anyway gets rid of the context within sentence, we resorted to word-level translations between Italian (our proof of concept language) and English. But, word level translations sometimes suffer from ambiguity. For instance, the word "Frank" in English could either mean a name or the adjective. Similarly, in our dataset for brand "Dove", the word "Dove" could either mean "where" or the brand "Dove" in Italian. So, we need to learn a model first to learn the probability scores of these word level translations between plausible alternatives.
This model needs a good bilingual lexicon with multiple possible translations of each of our vocabularies and the quality of lexicon and word tokenization is key to the performance of this model. There are two possible alternatives for this lexicon - direct downloadable files if they exist or Python’s vocabulary package which has a translation module which is a wrapper around a machine translation query and returns multiple translations of a given word. Thus, we construct lexicons translating both ways as two Python dictionary files.
Now consider two phrases in Italian "Amo Dove" (Love Dove - the brand), "Dove Sei" (Where are you?). If the model has no additional information, in both cases it is logical to translate "dove" into "dove" and "where" with 50% chance. But, if we know that the phrase is talking about the brand or not talking about the brand, then the translation probabilities vary drastically. So, here the class variable

(brand or not) has useful information that will help the model to translate better. We want to capture this idea using a simple directed graphical model. In this approach, we use a generative model that generates a new document

in our target language (Italian). The document

could be simply viewed as a bag-of-words based on our naive assumptions. Similarly, let us call the corresponding source document

. The graphical model could be seen in the following figure:
We can write the following conditional probabilities for inference directly from the graphical models:
In this approach, we then use Expectation Maximization to perform latent variable inference which involves training an iterative algorithm that oscillates between these two steps.
Once this iterative algorithm reaches convergence, we have learned the much-needed conditional word translation probabilities
)
i.e. given a word in source language and the class, what is the probability of this source word generating a word in the target language over its entire vocabulary?
Our next step is to do model transfer from one language to another. For the sake of simplicity, we are going to use a Max Entropy Classifier/Softmax Regression which degenerates to logistic regression or binary classification tasks as ours is. We can formally define the classifier as:
- Expectation:
&space;&=&space;\frac{P(w',C,w)}{P(w)}&space;=&space;\frac{P(w|w',C)&space;P(w',C)}{\sum_C&space;\sum_{w'}&space;P(w|w',C)&space;P(w',C)&space;}&space;\quad&space;\end{align*})
- Maximization:
&space;&=&space;\frac{f(w)P(w',C|w)}{\sum_{w&space;\in&space;K}&space;f(w)P(w',C|w)}&space;\end{align*})
Under our bag-of-words assumption, our model transfer can be derived as the following equation (for a slightly detailed derivation, check the paper "Cross Language Text Classification by Model Translation and Semi-Supervised Learning"):
We have a simple expression that is easy to maximize over our 2 classes. The term that is being maximized works under bag-of-words assumption (leading to the outer product on each word of the vocabulary) and has a clean interpretation: it is the classification score of each translation of a word weighed by its conditional translation probability that we learned via Expectation Maximization. Thus, we learn a simple inference model that helps us perform model translation via mixture of word translations.
B1, B2, B3 - 25%, 50% and 100% labeled data in Italian
MM - Mixture Model without labeled data
We can see that our Mixture of Word Translations model is able to perform competitively as compared to the baselines with full English data and very little labeled data in Italian (only for supervision) whereas the baselines struggle relatively with less amount of data. The model requires only unlabeled data in target language to learn translation probabilities and is pretty cheap to scrape from social media. We note that our lexicon is still under-constructed and definitely has a lot of scope for improvement. Thus, we have been able to train a model that is data-efficient and performs knowledge transfer to leverage our baseline’s model’s core capabilities in English.
Now let us discuss the merits and demerits of this model.
- It requires only a bilingual lexicon for training.
- It builds and maintains only one single classifier on English vocabulary, thus preventing the problem of vocabulary explosion when many languages are in play.
- It is interpretable and adaptible to any classifier.
- Saves data costs significantly as you do not need labeled data in target language apart from cases where you want further semi-supervised training.
The potential concerns (along with possible extensions) with this model include:
- Training the EM algorithm is time consuming - each iteration is
where
is vocabulary size. We can use sampling based inference or approximate inference such as variational inference for quick and efficient inference in this setting.
- Generating quality lexicon is not easy. This is because we do not have access to good quality clean data but few common languages has lexicons from other sources that are reliable to an extent and the users can further manually fine-tune the lexicon. Having said that, the lexicon is a one-time cost for each new langauge and the model works seamlessly after getting a good lexicon.