The following list contains additional information and comments about the project.
This page describes the Naive Bayesian Learner assiignment of the course COMP 4360 Machine Learning. The assignment is based on an idea presented by Peter Norvig in Thrun’s and Norvig’s online AI class at Stanford.
The goal of the assignment is to implement an automatic machine translation system similar to Google translate. Implementing a machine learning system to learn how to translate between two languages is extremely difficult and still an open problem. Just try some translations with Google translate - a human can often understand the meaning of some text, even though the translations includes many syntactic and semantic errors.
Therefore, we are limiting ourselves to a specific domain - restaurant menus - and implement a machine learning system to learn to translate various dishes between English and Chinese. For example, after training you should be able to query the system on some unknown new dish (e.g., 牛肉面 => Beef Noodle Soup and vice versa). I scraped two files of Chinese <-> English menu item translations and converted them into csv format (chinese_dishes.csv, chinese_dishes2.csv)
You will use a Naive Bayesian approach to train your machine learning system. To translate from English into Chinese, we want to find an English utterance with the maximum a posterior probability given the Chinese sentence and the training data.
The probability of an utterance is the product of the probabilities of the individual words. In other words, we assume that the words are statistically independent. That is clearly a gross oversimplification, but a necessary one to keep the computation tractable.
In the first part of the assignment, we will calculate the likelihood and prior probabilities for individual words. So we calculate the posterior probability of an utterance T=t1 t2 t3 given an input in the original language O as:
P(T|O) = P(T|O)*P(O) = P(t1|O)*P(O) * P(t2|O)*P(O) …
One problem is that the utterance in the other language in itself will contain multiple words O=o1 o2 o3 … So you will have to come with an idea for how to approximate the probabilities for the language in the original language.
Extend the algorithm that you developed to include two word combinations rather than just single words. For example, so far your system will have independently calculated the probabilities for black and bean, however black bean is a very common expression as opposed to blue bean. Add all two word combinations that occur in the training text to the table of probabilities.
Make sure that you do not set the likelihood and prior probabilities to exactly 0, but rather a small constant to deal with the fact that you do not want to rule out an utterance just because it has not occurred exactly in your training data.
One issue with Naive Bayesian Learners is that the probabilities are often quite small and multiplying them together will lead to underflow in the computation. To avoid this problem, you should add the log of the probabilities instead.
To hand in your assignment:
You should receive a response saying how many bytes were handed in - all of the files you put in your handin directory are compressed for handin, so the number reported may seem unusually small.
If you change your mind about what is handed in or think you've made a mistake, you can enter the handin command as often as you want, each new handin will erase the old. Remember there is a hard deadline on the due date, at which point the handin system will be disabled.
When marking your assignments, the marker will get a copy of everything you placed in the directory. They will start a telnet unix session, cd into your source directory. Depending on the programming language that you have used they will either: (a) execute make , followed by ./a.out [args] in case of C or C++, or (b) javac *.java followed by java ass2 [args] in case of a Java submission.
You will get an email back to the cc account you used to do the handin after the marker is finished. Please ensure the mail on this account is not forwarded elsewhere.