Data analytics
Data Analytics is a subject that can be best appreciated only when applied to a dataset you are familiar with. The aim of this project is to achieve that. Do not view this project as a hurdle in the course, rather a bridge to connect the topics you learnt to your work or subject domain. There are five main modules in this course:
- Module 1 : Normal Distribution (Percentile, distribution of means, and chance of occurrence if we assume normal distribution)
- Module 2 : Confidence Interval Estimation (Including Sample Size determination)
- Module 3 : Inferences from data (Hypothesis testing, i.e., confirming or checking if a claim made about
- >span class="textLayer--absolute">Module 4 : More Inferences from data (Multiple samples)
- Module 5 : Regression analysis (Both simple and multiple, apart from basic ANOVA)
Objective
>span class="textLayer--absolute">at least 4 modules on your dataset and make some inferences or estimations. Remember, each Hawkes learning quiz had 10-15 questions. Here I am asking you to do only 4 tests or analysis. But the key is – you bring the data and you come up with the question, and each question/set of analysis represents something you learnt from the Modules (1-5). There should be four different ones. >span class="textLayer--absolute"> tounderstand the concepts you learnt in this course. If you wish, you can use two data sources (datasets) to achieve it. It is not necessary all of them have to be done using one dataset.
Data source
>span class="textLayer--absolute">(there are no restrictions on that)
- Bring your own data from work (you can remove any private or confidential information, for example: if you are bringing any sales or cost data of an item/product or service – the name can be masked)
- Use data from your previous work or company you have access to (again you can remove any private/confidential information)
- Use data from public domain – In today’s world, there is no dearth of structured data. Here are some places where you can get data from:
Note: But remember, not all data are suitable for project. You need to have minimum number of data points (see below in requirements) and the data set cannot be random numbers. Proper citation is needed for source of data.
- Any data source you have access to like the Hawkes Learning Resources
- Datasets (1) from Hawkes
- Datasets (2) from Hawkes - Look at the additional datasets, not the chapter datasets
- U.S. Bureau of Labor Statics
- U.S. Government’s open data
- Center for Medicare and Medicaid services
- Kaggle datasets
- WHO Data repository
- World Bank Data
- Karami research lab's short link to public databases with data
- Google Public data explorer
- Amazing visualization or graphics
- But remember, we need the data to do analysis, if you look at the bottom of any figure – Google would provide the source name, and you can retrieve data from there.
- Any sports data (from the appropriate website, getting data in structured format for several years might be challenge, but a few minutes or an hour – you can do it)
- For example – Cricket data could be obtained from espncricinfo.
- Any data source you have access to like the Hawkes Learning Resources
Requirements:
- No more than 1.5 to 2 pages.
- You should describe your source of data (including the data fields you have) and what you want to accomplish based on the topics you learnt.
- You can state the research hypothesis you plan to check, confidence intervals you plan to estimate, or test any relationship between variables you think is important.
- Remember - I need at least your plan based on the first three modules (see examples). No need for analysis, just what you plan to do.
- The key for the project: Select datasets with at least 30 data points (it would be better with more data like 100 data points). Come up with sensible questions that needs statistical validation!
- If the data is not good, you will receive 0 for the midterm report, but feedback will be given on how to fix and move towards the final project.
I will provide feedback within 4 days to each of you (if you submit early, you get your feedback early), if I feel any change is needed – I will indicate that.
How are the 15 points given:
- Your Data: 5 points (Note: Remember, the sample size should be at least 30 data points to due any parametric tests - aim for at least 50 to 100+ data points for Master's level project)
- Your plan of action: 10 points