COMP-4360 Machine Learning Assignment 2

Due: Wednesday, 29th March 2012 at 15:30

The following list contains additional information and comments about the project.

  1. Thu Mar 15 13:18:34 CDT 2012: Assignment opened.

Introduction

This page describes the Naive Bayesian Learner assiignment of the course COMP 4360 Machine Learning. The assignment is based on an idea presented by Peter Norvig in Thrun’s and Norvig’s online AI class at Stanford.

The goal of the assignment is to implement an automatic machine translation system similar to Google translate. Implementing a machine learning system to learn how to translate between two languages is extremely difficult and still an open problem. Just try some translations with Google translate - a human can often understand the meaning of some text, even though the translations includes many syntactic and semantic errors.

Therefore, we are limiting ourselves to a specific domain - restaurant menus - and implement a machine learning system to learn to translate various dishes between English and Chinese. For example, after training you should be able to query the system on some unknown new dish (e.g., 牛肉面 => Beef Noodle Soup and vice versa). I scraped two files of Chinese <-> English menu item translations and converted them into csv format (chinese_dishes.csv, chinese_dishes2.csv)

Naive Bayesian Learner

You will use a Naive Bayesian approach to train your machine learning system. To translate from English into Chinese, we want to find an English utterance with the maximum a posterior probability given the Chinese sentence and the training data.

The probability of an utterance is the product of the probabilities of the individual words. In other words, we assume that the words are statistically independent. That is clearly a gross oversimplification, but a necessary one to keep the computation tractable.

Word-Based Approach (70%)

In the first part of the assignment, we will calculate the likelihood and prior probabilities for individual words. So we calculate the posterior probability of an utterance T=t1 t2 t3 given an input in the original language O as:

P(T|O) = P(T|O)*P(O) = P(t1|O)*P(O) * P(t2|O)*P(O) …

One problem is that the utterance in the other language in itself will contain multiple words O=o1 o2 o3 … So you will have to come with an idea for how to approximate the probabilities for the language in the original language.

Two Word Combinations (30%)

Extend the algorithm that you developed to include two word combinations rather than just single words. For example, so far your system will have independently calculated the probabilities for black and bean, however black bean is a very common expression as opposed to blue bean. Add all two word combinations that occur in the training text to the table of probabilities.

Hints

Make sure that you do not set the likelihood and prior probabilities to exactly 0, but rather a small constant to deal with the fact that you do not want to rule out an utterance just because it has not occurred exactly in your training data.

One issue with Naive Bayesian Learners is that the probabilities are often quite small and multiplying them together will lead to underflow in the computation. To avoid this problem, you should add the log of the probabilities instead.

Submission

To hand in your assignment:

  1. create a directory to hold all of your data. The directory name should be 4550_a<num>_< login_name > .
  2. Inside of the directory comp_<course_number>_assignment_<num>_< login_name > , create a directory source .
  3. put your source code for your assignment solution (it's ok if you have more than one source file) into the directory source .
  4. Your program must compile by just running the command make in the source directory.
  5. place a copy of the electronic handin honesty declaration (modified to include your own name and student number) at the top level above your source directories -understand that doing so and handing it in from your account both securely identifies you and holds you to the responsibilities of a paper honesty declaration.
  6. Write a readme.txt file to explain anything you feel is necessary. This may include special features/bugs of your program. What parts of the assignment did you implemement. Making things clear here for the marker is in your own best interest.
  7. Once you're happy that you have everything and it all works the way it should, move outside the directory itself (i.e. make sure you can see the name of the directory in the list of files if you do an ls).
  8. at the unix prompt, type: handin <course_number>_ a<num>_< login_name >. For example: handin 4360_a3_umsmith

You should receive a response saying how many bytes were handed in - all of the files you put in your handin directory are compressed for handin, so the number reported may seem unusually small.

If you change your mind about what is handed in or think you've made a mistake, you can enter the handin command as often as you want, each new handin will erase the old. Remember there is a hard deadline on the due date, at which point the handin system will be disabled.

When marking your assignments, the marker will get a copy of everything you placed in the directory. They will start a telnet unix session, cd into your source directory. Depending on the programming language that you have used they will either: (a) execute make , followed by ./a.out [args] in case of C or C++, or (b) javac *.java followed by java ass2 [args] in case of a Java submission.

You will get an email back to the cc account you used to do the handin after the marker is finished. Please ensure the mail on this account is not forwarded elsewhere.