联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codehelp

您当前位置:首页 >> C/C++程序C/C++程序

日期:2021-03-24 08:17

DRAFT
Statistical Learning Assignment - Semester 1, 2021
? INSTRUCTIONS:
1. The assignment must be typed (not handwritten). You may either use Microsoft Word (or similar)
or R markdown in RStudio for the assignment. Note that the final project will require the use of
R markdown. When answering this question, it should be no longer than 10 A4 pages
[single sided] with a font size no smaller than 11 point.
2. The assignment due date is listed on the Wattle (Turn-it-in) site. Upload the assignment through
Wattle using Turn-it-in. You should submit your assignment in two different parts. If you are
using R markdown:
(a) A pdf file [or HTML file] of your assignment (this should include important R code to highlight
what you have done).
(b) A ‘.Rmd’ file [an R markdown file].
If you are using Microsoft Word (or similar):
(a) A Word file of your assignment (this should include important R code to highlight what you
have done).
(b) A ‘.R’ file of your R code.
3. In answering the questions, write your answers clearly and succinctly. Use appropriate graphs and
tables when you think they help to describe your point or thinking process. Do not just “print” a
set of results. Every result should be discussed and have a reason for being presented. No points
will be awarded unless you clearly discuss what you are doing.
4. No late assignments will be accepted.
5. You should not discuss the assignment (questions, solutions, code, etc.) with your
classmates or other individuals. You can discuss these with me or your tutor (Dr.
Ha Nguyen) during our consultation times. You must independently write your own
solutions. This includes all computer code, English, and mathematics. University
policies on academic integrity will be strictly enforced. See http://www.anu.edu.au/
students/program-administration/assessments-exams/academic-honesty-plagiarism for
more details.
6. Have fun with the exploration!
1. (100 points) We will explore some of the techniques we are considering by examining data on housing
prices. We will use the data from the prediction competition available on Kaggle https://www.kaggle.
com/c/house-prices-advanced-regression-techniques. For this question you will need to create
an account on Kaggle. Please let me know if you don’t want to use Kaggle based on privacy concerns.
(a) Create an account on Kaggle. What is your Kaggle username? Download the training and test
data.
(b) Consider a multiple regression model to examine the relationship between housing sale prices (Y )
in Ames, Iowa, USA from 2006 to 2010 and their covariate information (x). While, 79 covariates
are available, for this assignment we will only use a few covariates. Only consider the following
covariates: LotArea, OverallCond, GrLivArea, FullBath, TotRmsAbvGrd, PoolArea. As the real
test data does not contain the response Y (SalePrice), split the training data in half. The first
half will be the new training data and the second half will be your personal test data. For this
assignment set α = 0.05.
1
DRAFT
i. (20 points) Using all of the training data together (personal training and test data), conduct
an exploratory data analysis. In doing your analysis make sure to identify any unusual points
and discuss why they are unusual. For this assignment do not remove any unusual points, only
comment on them (if they exist). In addition to visualisations of the raw data, consider the
natural log transformation of the response. You may also consider any transformations of the
covariates. For the rest of the assignment, if you believe the transformations are appropriate
(provide justification - this can simply be a discussion), use those transformations.
ii. (6 points) Using just your personal training data and the covariate GrLivArea, based on
traditional regression approaches (possibly: t-tests, F-tests, etc.), determine if there exists
a non-linear (quadratic, cubic, etc.) between the covariate and the response. How flexible
should the model be? Make sure to fully outline any tests and conclusions.
iii. (6 points) Using your personal training and personal testing data, along with the notion of
squared error loss, determine if there exists a non-linear (quadratic, cubic, etc.) relationship
between the covariate and the response. How flexible should the model be?
iv. (6 points) Consider all the covariates which we are using in this assignment: LotArea, OverallCond,
GrLivArea, FullBath, TotRmsAbvGrd, PoolArea. Using just your personal training
data and traditional regression approaches, determine if any of the variables are statistically
significant. Are you able to reduce the model (i.e. not use all the covariates)? Here you do
not need to consider any non-linearities or interactions. Make sure to fully outline any tests
and conclusions.
v. (6 points) Based on the ordering of the covariates in your final model in the previous question,
using your personal training and personal testing data, along with the notion of squared error
loss, determine which covariates should be included in the model.
vi. (6 points) Consider all the covariates which we are using in this assignment: LotArea, OverallCond,
GrLivArea, FullBath, TotRmsAbvGrd, PoolArea. Using just your personal training
data and traditional regression approaches, determine if PoolArea has a statistically significant
interaction with any of the other covariates. You may have up to five interactions in
your model. Make sure to fully outline any tests and conclusions.
vii. (6 points) Based on the ordering of the covariates in your final model in the previous question,
using your personal training and personal testing data, along with the notion of squared error
loss, determine which interactions should be included in the model.
viii. (6 points) Consider all the covariates which we are using in this assignment: LotArea,
OverallCond, GrLivArea, FullBath, TotRmsAbvGrd, PoolArea. You may now consider any
modelling that you wish using your personal training data. You may also consider any type of
model selection approach (i.e. traditional or based on squared-error loss for the testing data).
Make sure to fully outline any tests and conclusions. Calculate the mean-squared error on
your personal testing data.
ix. (6 points) Using your final model from Question 1(b)viii and the Kaggle test data, submit a
prediction file to Kaggle. See Kaggle for details on what the file should look like. What was
your score and rank?
? Note: as discussed on the site (https://www.kaggle.com/c/titanic/details/evaluation),
“[t]he Kaggle leader-board has a public and private component. 50% of your predictions
for the test set have been randomly assigned to the public leader-board (the same 50%
for all users). Your score on this public portion is what will appear on the leader-board.
At the end of the contest, we will reveal your score on the private 50% of the data, which
will determine the final winner. This method prevents users from ‘over-fitting’ to the
leader-board.”
x. (6 points) Examining the leader board you can see that one individual has a perfect score
(when I last looked). Is this surprising? What explanation might there be for this?
2
DRAFT
xi. (6 points) This Kaggle competition is using Root Mean Squared Logarithmic Error instead
of Mean Squared Error. Provide a discussion about the difference between the two criteria.
xii. (20 points) Provide a full discussion of your final model from Question 1(b)viii. This may
include, but is not limited to, discussions of the coefficients, visualisations of the fitted model,
and model checking.
3

版权所有:留学生编程辅导网 2021,All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。