UpdatesHome

COMP-4360 Machine Learning Assignment 1

Due: Monday, 30th January 2012, 15:30

The following list contains additional information and comments about the project.

Introduction

This page describes the first assignment for the course COMP-4360 Machine Learning.

The assignment covers chapters 1 to 3 in Tom Mitchell's book Machine Learning. It focuses on issues in the learning of simple concepts using decision tree induction, ID3, validation, and reduced error pruning.

City of Winnipeg Census Data Set

The City of Winnipeg makes census data for several years freely available on their website http://winnipeg.ca/census/2006/Selected%20Topics/default.asp#data. The data includes the census data for 1996, 2001, and 2006. The data is available in the form of xls spreadsheets that can be opened via Excel or similar programs such as LibreOffice.

The data includes information about the make-up of the population, their mobility, the cost of housing, income, etc.

The data is ordered by ward: Agassiz, Airport, ...

The goal of this assignment is to compare various methods for learning a classfier to predict the income of a population.

Pre-processing the Census Data Set

As in most machine learning applications, this data needs to be pre-processed so that it can be converted into attribute-value representation.

The first part of this assignment is to download and merge the data from the website. Create a spreadsheet or database with at least the following information as columns: "Percent of Total Population With Aboriginal Identity", "Average household income", "Did not move", "Moved within Winnipeg", "Moved within Manitoba", "Moved within Canada", "Moved internationally", "Participation Rate (15 years and over)", "Employment Rate (15 years and over)", "Unemployment Rate (15 years and over)", "Participation Rate (15 - 24 years)", "Employment Rate (15 - 24 years)", "Unemployment Rate (15 - 24 years)", "Participation Rate (25 years and over)", "Employment Rate (25 years and over)", "Unemployment Rate (25 years and over)", "Average Rent", "Average value of dwelling", "Cars, truck, van", "Public transit", "Passenger", "Walk", "Bicycle", "Taxicab", "Motorcycle", "Others", "Average Number of Persons per Census Family", "Average Number of Children at Home per Census Family", "Population Age 0 to 4 Years Percent", "Population Age 5 to 9 Years Percent", "Population Age 10 to 14 Years Percent", "Population Age 65 to 74 Years Percent", "Population Age 75 Years and Over Percent", "Recent Immigrants Percent of Total Population", "Percent of Total Population that are Visible Minorities".

Use percentages to normalize the data with respect to the number of households in a district.

Use the following simple strategy to convert the continuous variables (i.e., percentages, income) into discrete attributes. Find the maximum and minimum value for the data and then break the range maximum to minimum into seven equal sized intervals XXL, XL, L, M, S, XS, XXS. If an entry is unknown, then replace the attribute with UNK.

If the minimum and maximum value are the same, then ignore this attribute and classify every value as M.

For example, the household income distribution for the City of Winnipeg is $21,559 (Lord Selkirk Park) to $253,396 (Old Tuxedo). This would imply the following attribute values for household income: $21,559 to $54,679 = XXS, $54,680 to $87,798 = XS, $87,799 to $120,918 = M, $120,919 to $154,037 = L, $154,038 to $187,157 = XL, and $187,158 to $253,396 = XXL.

You can either pre-process the data manually, write an Excel or LibreOffice formula or macro, or write a small program to convert the data.

The output of your pre-processing should look something like this:

Neighbourhood,Percent of Total Population With Aboriginal Identity,Average household income,Did not move,Moved within Winnipeg,Moved within Manitoba,Moved within Canada,Moved internationally,"Participation Rate
(15 years and over)","Employment Rate
(15 years and over)","Unemployment Rate
(15 years and over)","Participation Rate
(15 - 24 years)","Employment Rate
(15 - 24 years)",Unemployment Rate (15 - 24 years),"Participation Rate
(25 years and over)","Employment Rate
(25 years and over)","Unemployment Rate
(25 years and over)",Average Rent,Average value of dwelling,"Cars, truck, van",Public transit,Passenger,Walk,Bicycle,Taxicab,Motorcycle,Others,Average Number of Persons per Census Family,Average Number of Children at Home per Census Family ,Population Age 0 to 4 Years,Population Age 5 to 9 Years,Population Age 10 to 14 Years,Population Age 14 to 65 Years,Population Age 65 to 74 Years,Population Age 75 Years and Over,Recent Immigrants Percent of Total Population,Percent of Total Population that are Visible Minorities
Agassiz,XXS,S,XXL,XS,XXS,XXS,XS,L,L,S,XL,S,XXL,L,M,XS,XXS,S,L,S,XS,XXS,L,XXS,XXS,XXS,S,S,XL,XS,XS,M,XL,XS,XXS,XXS
Airport,S,XXL,L,XS,XXS,XXL,XXS,XXL,XXL,XXS,XXL,XXL,XXS,XXL,XL,XXS,S,XXS,S,S,XXS,XL,XXS,XXS,XXS,XXS,XS,S,XXS,L,XL,S,XXS,XXS,XXS,XXS
Alpine Place,XXS,XXS,L,M,XS,XXS,S,M,S,XS,L,L,XXS,S,S,XS,S,XS,M,XL,XS,XXS,XS,XXS,XXS,XS,XS,XS,S,XS,XXS,S,L,S,XS,XXS
Amber Trails,XXS,S,XL,S,XXS,XXS,XS,XL,XL,XS,L,L,S,XL,XL,XS,XXS,S,XL,XXS,S,XXS,XXS,XXS,XXS,XXS,L,M,S,M,M,XS,XXS,XXS,XS,L
...

Decision Tree

Generate a CSV file with the pre-processed data as described above.

Use household income = XXL and household income = XXS as the target concepts that you are trying to learn.

Question 1 ID3 (50%)

Implement the ID3 decision tree algorithm as described in class. Your program should take your CSV data as input and print out the decision tree as generated by ID3.

Analysis

Question 2 Most Significant Attributes (10%)

What are the top five attributes that determine income level? Discuss and try to explain your results. Did you find the outcome of your analysis surprising?

Question 3 Holdout Validation (20%)

Estimate the generalization ability of your algorithm using a 10% holdout method. That is, select 10% of the data, train on the remaining 90% and then test the accuracy on the 10% test set.

Overfitting

Overfitting is a problem in many machine learning domains. The algorithm's ability to predict values in the training data improves with more complex models, but its ability to classify previously unseen instances deteriorates.

Breiman et al (1984) suggest "Reduced Error Pruning" as implemented in the CART system as a method to avoid overfitting.

Question 4 Reduced Error Pruning (20%)

Implement Reduced Error Pruning (i.e., greedily remove nodes that do not lead to increased error in the validation set) for your ID3 algorithm.

Briefly, show and discuss the performance of the ID3 reduced error pruning algorithm on the Winnipeg Census data set.

Sample Domain

A small sample CSV file describing the play sports domain is given to help you debug your algorithms.

Submission

First, remember, your assignment must run on the acn unix systems with no requirements other than a simple terminal session (i.e. text only).

To hand in your assignment:

  1. create a unix directory to hold all of your data. The directory name should be < login_name >_cs_4360_a1 .
  2. Inside of the directory < login_name >_cs4360_a1 , create a directory source .
  3. put your source code as well as your results files (*_result.txt) for your assignment solutionm (it's ok if you have more than one source file) into the directory source . Please do not hand in a compiled executeable as well - they're relatively large and just congest things. So this directory should include only .java, .c, ,cpp, .h and Makefile etc.
  4. Your results files should show the output of your program when run on the standard testset.

  5. place a copy of the electronic handin honesty declaration (modified to include your own name and student number) at the top level above your source directories -understand that doing so and handing it in from your account both securely identifies you and holds you to the responsibilities of a paper honesty declaration.

  6. Write a readme.txt file to explain anything you feel is necessary. (e.g. which java class is your executable, any oddities you feel might affect marking, and so on). Making things clear here for the marker is in your own best interest.

  7. Once you're happy that you have everything and it all works the way it should, move outside the directory itself (i.e. make sure you can see the name of the directory in the list of files if you do an ls).
  8. at the unix prompt, type: handin 4360 a1 < login_name >_cs4360_a2

You should receive a response saying how many bytes were handed in - all of the files you put in your handin directory are compressed for handin, so the number reported may seem unusually small.

If you change your mind about what is handed in or think you've made a mistake, you can enter the handin command as often as you want, each new handin will erase the old. Remember there is a hard deadline of 15:30 on the due date, at which point the handin system will be disabled.

When marking your assignments, the marker will get a copy of everything you placed in the directory. They will start a telnet unix session, cd into your source directory. Depending on the programming language that you have used they will either: (a) execute make , followed by ./a.out [args] in case of C or C++, or (b) javac *.java followed by java ass2 [args] in case of a Java submission.

You will get an email back to the cc account you used to do the handin after the marker is finished. Please ensure the mail on this account is not forwarded elsewhere.


Page was last updated on %s at %s.

", $mod_date, $mod_time); ?>
Hansjoerg Baltes (jacky)