The following list contains additional information and comments about the project.
Thu Jan 19 09:01:35 CST 2012: Added reduced error pruning.
Sun Jan 15 19:32:15 CST 2012: Assignment opened.
This page describes the first assignment for the course COMP-4360 Machine Learning.
The assignment covers chapters 1 to 3 in Tom Mitchell's book Machine Learning. It focuses on issues in the learning of simple concepts using decision tree induction, ID3, validation, and reduced error pruning.
The City of Winnipeg makes census data for several years freely available on their website http://winnipeg.ca/census/2006/Selected%20Topics/default.asp#data. The data includes the census data for 1996, 2001, and 2006. The data is available in the form of xls spreadsheets that can be opened via Excel or similar programs such as LibreOffice.
The data includes information about the make-up of the population, their mobility, the cost of housing, income, etc.
The data is ordered by ward: Agassiz, Airport, ...
The goal of this assignment is to compare various methods for learning a classfier to predict the income of a population.
As in most machine learning applications, this data needs to be pre-processed so that it can be converted into attribute-value representation.
The first part of this assignment is to download and merge the data from the website. Create a spreadsheet or database with at least the following information as columns: "Percent of Total Population With Aboriginal Identity", "Average household income", "Did not move", "Moved within Winnipeg", "Moved within Manitoba", "Moved within Canada", "Moved internationally", "Participation Rate (15 years and over)", "Employment Rate (15 years and over)", "Unemployment Rate (15 years and over)", "Participation Rate (15 - 24 years)", "Employment Rate (15 - 24 years)", "Unemployment Rate (15 - 24 years)", "Participation Rate (25 years and over)", "Employment Rate (25 years and over)", "Unemployment Rate (25 years and over)", "Average Rent", "Average value of dwelling", "Cars, truck, van", "Public transit", "Passenger", "Walk", "Bicycle", "Taxicab", "Motorcycle", "Others", "Average Number of Persons per Census Family", "Average Number of Children at Home per Census Family", "Population Age 0 to 4 Years Percent", "Population Age 5 to 9 Years Percent", "Population Age 10 to 14 Years Percent", "Population Age 65 to 74 Years Percent", "Population Age 75 Years and Over Percent", "Recent Immigrants Percent of Total Population", "Percent of Total Population that are Visible Minorities".
Use percentages to normalize the data with respect to the number of households in a district.
Use the following simple strategy to convert the continuous
variables (i.e., percentages, income) into discrete
attributes. Find the maximum and minimum value for the data and
then break the range maximum to minimum into seven equal sized
XXL, XL, L, M, S, XS, XXS. If an entry is
unknown, then replace the attribute with
If the minimum and maximum value are the same, then ignore this attribute and classify every value as M.
For example, the household income distribution for the City of Winnipeg is $21,559 (Lord Selkirk Park) to $253,396 (Old Tuxedo). This would imply the following attribute values for household income: $21,559 to $54,679 = XXS, $54,680 to $87,798 = XS, $87,799 to $120,918 = M, $120,919 to $154,037 = L, $154,038 to $187,157 = XL, and $187,158 to $253,396 = XXL.
You can either pre-process the data manually, write an Excel or LibreOffice formula or macro, or write a small program to convert the data.
The output of your pre-processing should look something like this:
Neighbourhood,Percent of Total Population With Aboriginal Identity,Average household income,Did not move,Moved within Winnipeg,Moved within Manitoba,Moved within Canada,Moved internationally,"Participation Rate (15 years and over)","Employment Rate (15 years and over)","Unemployment Rate (15 years and over)","Participation Rate (15 - 24 years)","Employment Rate (15 - 24 years)",Unemployment Rate (15 - 24 years),"Participation Rate (25 years and over)","Employment Rate (25 years and over)","Unemployment Rate (25 years and over)",Average Rent,Average value of dwelling,"Cars, truck, van",Public transit,Passenger,Walk,Bicycle,Taxicab,Motorcycle,Others,Average Number of Persons per Census Family,Average Number of Children at Home per Census Family ,Population Age 0 to 4 Years,Population Age 5 to 9 Years,Population Age 10 to 14 Years,Population Age 14 to 65 Years,Population Age 65 to 74 Years,Population Age 75 Years and Over,Recent Immigrants Percent of Total Population,Percent of Total Population that are Visible Minorities Agassiz,XXS,S,XXL,XS,XXS,XXS,XS,L,L,S,XL,S,XXL,L,M,XS,XXS,S,L,S,XS,XXS,L,XXS,XXS,XXS,S,S,XL,XS,XS,M,XL,XS,XXS,XXS Airport,S,XXL,L,XS,XXS,XXL,XXS,XXL,XXL,XXS,XXL,XXL,XXS,XXL,XL,XXS,S,XXS,S,S,XXS,XL,XXS,XXS,XXS,XXS,XS,S,XXS,L,XL,S,XXS,XXS,XXS,XXS Alpine Place,XXS,XXS,L,M,XS,XXS,S,M,S,XS,L,L,XXS,S,S,XS,S,XS,M,XL,XS,XXS,XS,XXS,XXS,XS,XS,XS,S,XS,XXS,S,L,S,XS,XXS Amber Trails,XXS,S,XL,S,XXS,XXS,XS,XL,XL,XS,L,L,S,XL,XL,XS,XXS,S,XL,XXS,S,XXS,XXS,XXS,XXS,XXS,L,M,S,M,M,XS,XXS,XXS,XS,L ...
Generate a CSV file with the pre-processed data as described above.
Use household income = XXL and household income = XXS as the target concepts that you are trying to learn.
Implement the ID3 decision tree algorithm as described in class. Your program should take your CSV data as input and print out the decision tree as generated by ID3.
What are the top five attributes that determine income level? Discuss and try to explain your results. Did you find the outcome of your analysis surprising?
Estimate the generalization ability of your algorithm using a 10% holdout method. That is, select 10% of the data, train on the remaining 90% and then test the accuracy on the 10% test set.
Overfitting is a problem in many machine learning domains. The algorithm's ability to predict values in the training data improves with more complex models, but its ability to classify previously unseen instances deteriorates.
Breiman et al (1984) suggest "Reduced Error Pruning" as implemented in the CART system as a method to avoid overfitting.
Implement Reduced Error Pruning (i.e., greedily remove nodes that do not lead to increased error in the validation set) for your ID3 algorithm.
Briefly, show and discuss the performance of the ID3 reduced error pruning algorithm on the Winnipeg Census data set.
A small sample CSV file describing the play sports domain is given to help you debug your algorithms.
First, remember, your assignment must run on the acn unix systems with no requirements other than a simple terminal session (i.e. text only).
To hand in your assignment:
< login_name >_cs_4360_a1.
< login_name >_cs4360_a1, create a directory
Your results files should show the output of your program when run on the standard testset.
place a copy of
handin honesty declaration (modified to include your own name
and student number) at the top level above your
source directories -understand that doing so and handing
it in from your account both securely identifies you and holds you
to the responsibilities of a paper honesty declaration.
Write a readme.txt file to explain anything you feel is necessary. (e.g. which java class is your executable, any oddities you feel might affect marking, and so on). Making things clear here for the marker is in your own best interest.
handin 4360 a1 < login_name >_cs4360_a2
You should receive a response saying how many bytes were handed in - all of the files you put in your handin directory are compressed for handin, so the number reported may seem unusually small.
If you change your mind about what is handed in or think you've made a mistake, you can enter the handin command as often as you want, each new handin will erase the old. Remember there is a hard deadline of 15:30 on the due date, at which point the handin system will be disabled.
When marking your assignments, the marker will get a copy of
everything you placed in the directory. They will start a telnet
cd into your
directory. Depending on the programming language that you have used
they will either: (a) execute
make , followed by
./a.out [args] in case of C or C++, or (b)
javac *.java followed by
java ass2 [args] in
case of a Java submission.
You will get an email back to the cc account you used to do the handin after the marker is finished. Please ensure the mail on this account is not forwarded elsewhere.
Page was last updated on %s at %s.", $mod_date, $mod_time); ?>