On Disambiguation of Word Meaning in Translation

On Disambiguation of Word Meaning in Translation A Probabilistic Topic Model approach to Improve Text Translation Software Abstract The probabilistic topic model is a powerful that has possible uses in […]

On Disambiguation of Word Meaning in Translation

A Probabilistic Topic Model approach to Improve Text Translation Software

Abstract

The probabilistic topic model is a powerful that has possible uses in a wide variety of applications, ranging from giving robots the ability to infer the meaning of new words in text recognition or even speech recognition to categorization of random articles from the internet for a Wikipedia-style “Related topics” section.

In this project I demonstrated a method which utilizes the topic model to improve the performance of language translation software. It has been my experience when I use the online translation software that it can make some funny translations if it gets confused about the context in which a certain word is used. This problem arises since the machine has no understanding of the topic content of the document. I assume that when in doubt it simply picks the most popular meaning of a word although that might not be appropriate for this given context.

If the machine would be given the ability to make inferences on the topic which words originate from it could check which topic is most common within a specific text and then highlight words that are most likely to originate from topics different from the most common one.

A Motivational Example

Here is a clause in German, obtained from Wikipedia, whose topic is Thermodynamics and mentions the theoretical physicist Josiah Willard Gibbs. Gibbs’s middle name Willard is mistakenly translated as “Wanting pool of broadcasting corporations”. This translation was obtained using the translation machine Babelfish.

German:

“Ihre heutige mathematische Struktur erhielt die Thermodynamik durch die Arbeiten von Josiah Willard Gibbs, der als Erster die Bedeutung der Fundamentalgleichung erkannt und ihre Eigenschaften formuliert hat.”

English translation:

“Their current mathematical structure received thermodynamics by the work from Josiah wanting pool of broadcasting corporations Gibbs, which as the first recognized the meaning of the fundamental equation and formulated its characteristics.”

A probabilistic topic model might be able to mark that the words “broadcasting”, “corporations” and maybe even “pool” are not words that are frequently used within the topic of Physics and thus highlight them to warn the user that they might have been the result of a mistranslation.

Probabilistic Topic Models

The general idea is that there is a generative process which produces all words in every document according to specific distributions. Each document has a distribution over topics and when any word in that document is produced, a topic is first selected from the document’s topic distribution and then a word from that topic’s word distribution.

The posterior probability of assigning a given topic to a single word, given the assignment of topics to all other words is given

The posterior probability of assigning a given topic to a single word, given the assignment of topics to all other words is given

Bayesian inference of hyperparameters
Given a corpus of documents, the problem is to infer the topic distribution for each document and the word distribution for the topics. Steyvers and Griffiths provide a posterior probability of a topic assignment to a word, given the assignments of all other words, to a constant of proportionality. Using this posterior probability function and Gibbs sampling, one can infer the topic and word distributions for fairly large corpora in a reasonable amount of time.

Implementation

A probabilistic topic model was trained using a corpus of 60 articles from Wikipedia, 20 originating from the topic of Business, 20 from the topic of Biology and 20 from Machine Learning. A test document is introduced (this would be the text from a translation machine), the topic model is used to estimate the topic distributions for every word in the document. The most prominent topic for most words would be chosen to be the documents true topic. Then the algorithm would search for words in the documents that had the biggest topic deviation from the chosen topic and underline them for the user to investigate further

Results

To show the results of the algorithm, a few sentences from each topic were slightly modified (words from different topics were added to them) and then we ran the algorithm on the new sentence and checked if it would find the inserted words. Inserted words were put in boldface and the words the algorithm detected as outliers in the sentence’s
topic were underlined.

The topic of Business:

“Finance is one of the most learning important aspects of business algorithm management. Without proper financial planning a heart new enterprise is unlikely to blood be successful. Managing money ( a liquid asset ) is essential kidney to ensure a secure future…”

In this case, our algorithm detected the majority of my changes to the document, it also made one false positive and one true negative error in the cases of “liquid” and “kidney”.

The topic of Biology:

“Some proteins that are enterprise made in the cytoplasm contain structural features that target them for transport into mitochondria or the robot nucleus. Some mitochondrial proteins are made inside mitochondria and are finance coded for by mitochondrial DNA. In plants, chloroplasts corporation also make some cell proteins…”

We can see that the algorithm found all inserted words in this sentence but it also marked some other words that do belong to this topic as being outliers. If we look at the words that were incorrectly marked by the algorithm we can see that they are mostly general purpose words that are not particularly a part of any one topic and that is why the algorithm had a hard time of classifying them.

The topic of Machine Learning:

“Support vector machines map economy input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the financial data. The separating hyperplane is the hyperplane that maximizes profit the distance between the two parallel virus hyperplanes. An assumption is made that the bacteria larger the margin
or distance between these parallel stomach hyperplanes the better the generalisation error of the classifier will be.”

For the most part, the comments from the biology section apply here to, except that in this case the algorithm missed the inserted word “stomach”.

About Siggi