Web Health Informatics Homework
Grading criteria
1-2/5 – Submitted but not functional, 2-3/5 Partially functioning, 4-5/5 – Functioning
For this assignment you will work with two health-related datasets and build regression models. This will involve two steps. First, you need to find good predictors for your regression models in R. Second, you need to make some modifications to some Python code provided to get one of your developed models working with scikit-learn.
Part A
Download RStudio and R. Import the health datasets. Open the regression.R code. To write a regression model, the format is itemToPredict ~ predictor1 + predictor2 + …. + predictorN. Examine the data for these two datasets. For braincancer.csv, predict the status variable (0 – alive, 1 – deceased). For Heart.csv, predict AHD (Acquired Heart Disease, 0 – no, 1 - yes).
Use R’s summary function to show you how good your chosen models are. For instance, the default code below is not doing a very good job predicting status; only 1 of the two predictors has minimal statistical significance and the Adjusted R-Squared value is very low. When you have found good predictors, paste the result of the summary output for both datasets as comments in your Python file for part B.
Part B
Open sample_code.py (you will rename this to username_A3.py later). Modify the Python code such that you can train one of your selected regression models in scikit-learn. For text values, if you used them, you may have to re-code them into numeric values, or re-code them in Excel.
Submit a single python file username_A3.py under the A3 dropbox. Note there are multiple valid solutions to this assignment.