· For this assignment, you need to submit the following TWO files.

1. A written document (A single pdf only) covering all of the items described in the questions. All answers to the questions must be written in this document, i.e, not in the other files (code files) that you will be submitting. All the relevant results (outputs, figures) obtained by executing your R code must be included in this document.

For questions that involve mathematical formulas, you may write the answers manually (hand written answers), scan it to pdf and combine with your answer document. Submit a combined single pdf of your answer document.

2. A separate “.R” file or ‘.txt’ file containing your code (R-code script) that you implemented to produce the results. Name the file as “name-StudentID-Ass2- Code.R" (where `name' is replaced with your name - you can use your surname or first name, and StudentID with your student ID).

· All the documents and files should be submitted (uploaded) via SIT 743 Clouddeakin Assignment Dropbox by the due date and time.

· Zip files are NOT accepted. All two files should be uploaded separately to the CloudDeakin.

· E-mail or manual submissions are NOT allowed. Photos of the document are NOT

allowed.

=================================================================

Assignment tasks

Q1) [40 Marks]

Melbourne City council conducted a survey to study the relationship between the type of dwelling and the income profile of the people living in Melbourne city. A list of factors that influence the type of dwelling, along with their possible values, and a Bayesian network that represents the relationship between these factors (variables) are given below.

E (Education) ∈ {Graduate, Non-graduate}

G (Gender) ∈ { Male, Female}

A (Age) ∈ {35 or less, 35+ }

S (Salary) ∈ {$50k or less per annum, $50k+ per annum}

J (Job type) ∈ {Professional, Laborer, Student, Unemployed}

M (Marriage) ∈ {Married, Widowed, Single}

D (Dwelling) ∈ {Rent, Own house}

Figure 1

1.1) Write down the joint distribution P(E, G, S, J, A, M, D) for the above network.

1.2) Find the minimum number of parameters required to fully specify the distribution according to the above network.

1.3)

a) Write down a joint probability density function if there are no independence among the variables is assumed.

b) How many parameters are required, at a minimum, if there are no independencies among the variables is assumed?

c) Compare with the result of the above question (Q1.2) and comment.

1.4) The Melbourne city council, from a previous study, found out that the Marriage is conditionally independent of job type, given the salary and Age. The Melbourne city council wants to modify the Bayesian network given in Figure 1 by incorporating this new information. Assume now that the Marriage is conditionally independent of job type, given the salary and Age, perform the following.

a) What change will happen to the Bayesian network (shown in Figure 1) when the above assumption is considered. Draw the new Bayesian network considering the above assumption (you may draw this by hand).

b) Compute the change in the minimum number of parameters required for this new Bayesian network, compared to the minimum number of parameters required for the Bayesian network shown in Figure 1. Comment on the results.

1.5) d-separation method can be used to find two sets of independent or conditionally independent variables in a Bayesian network. Use the Bayesian network given in Figure 1 to answer the following:

For each of the statements/questions given below from (a) to (b), perform the following:

· List all the possible paths from the first (set of) node/s to the second (set of) node/s considered for the independence check.

· State if each of those paths is blocking or non-blocking with reasons.

· Hence, answer the question about independence.

a) Is dwelling (D) conditionally independent of gender (G) given salary (S)

and job type (J)?

b) Is {E, G} ⊥ A | {D, J} ?

1.6) For the Bayesian network shown in Figure 1, find all the nodes that are conditionally independent of education (E) given marriage (M).

1.7) Write a R-Program to produce the Bayesian network shown in Figure 1, and perform the d-separation tests for cases given below. Show the plot of the network you obtained and the output (of d-separation test) from your program.

a) E ⊥ {A, G} | {S, M}

b) {S, A} ⊥ G | {E, J, D}

1.8) For the Bayesian network shown in Figure 1,

a) find the Markov blanket of job type (J).

b) find all the nodes that are conditionally independent of job type (J) given its Markov blanket.

c) use R program to find the Markov blanket of salary (S). Plot the Bayesian network and show the Markov blanket nodes in the network using different colour.

1.9) For the Bayesian network shown in Figure 1,

a) show the step by step process to perform variable elimination to compute P(J | M = Married, A = 35+, E = Graduate). Use the following variable ordering for the elimination process:

G, S, D.

b) what is the treewidth of the network, given the above elimination ordering?

[Marks 2+4+5+5+8+2+4+5+5 = 40]

Q2) [20 Marks] Implementing a Bayesian network in R and performing inference

A belief network models the relation between the T, W, H, R and S, which represents the temperature, wind speed, humidity, precipitation, and solar radiation respectively. Each variable takes different states as given below.

T (teneerature) ∈ {cold, hot}

W (wind seeed) ∈ {low, nediun, ℎigℎ} H (ℎunidity) ∈ {low, ℎigℎ}

R (erecieitation) ∈ {low, ℎigℎ}

S (solar radiation) ∈ {low, nediun, ℎigℎ}

The belief network that models these variables has (probability) tables as shown below.

2.1) Use the below libraries in R to create this belief network in R along with the probability values, as shown in the above table.

You may use the following libraries for this: #https://www.bioconductor.org/install/ #BiocManager::install(c("gRain", "RBGL", "gRbase")) #BiocManager::install(c("Rgraphviz")) library("Rgraphviz")

library(RBGL) library(gRbase) library(gRain)

#define the appropriate network and use the “compileCPT()”function to Compile list of conditional probability tables, and create the network.

a) Show the obtained belief network for this distribution

b) Show the probability tables obtained from the R output, (and verify with the above table).

2.2) Use R program to compute the following probabilities:

a) Given that the temperature is cold, what is the probability that humidity is

high?

b) Find the joint distribution of temperature, humidity and precipitation.

c) Given that the wind speed is medium and the precipitation is high, what is the probability that the sloar radiation is high?

d) Find the marginal distribution of precipitation.

e) Find P(R=high | T=cold, H=high)

f) Find P(R=high | T=cold, H=high, S=low)

g) Find P(R=high | T=cold, H=high, W=medium)

h) Compare the results obtained in Q2.2 e), Q2.2 f), and Q2.2 g) above. Explain the reason for the observed behavior.

[Marks: (3+3) + (2+2+2+2+1+1+1+3) = 20]

Q3) [16 Marks]

Consider five binary variables A, B, C, D and E. The Directed Acyclic Graph (DAG) shown below describes the relationship between these variables along with their conditional probability tables (CPT).

3.1) Obtain an expression (in a simplified form) for

P(C = 1|A = 0, B = 0, D = 1, E = 0). (Show the steps clearly).

3.2)

The table shown below provides 20 simulated data obtained for the above Bayesian network. Use this data to find the maximum likelihood estimates of α, θ, þ, h and a.

3.3) Find the value of P(C = 1|A = 0, B = 0, D = 1, E = 0) using the appropriate values obtained from the above question Q3.2.

[Marks 9 + 5 + 2 = 16]

Q4) Bayesian Structure Learning [26 Marks]

For this question, you will be using a dataset, called “alarm” available from the ‘bnlearn’ R package. which contains 37 variables. This provide an alarm message system for patient monitoring.

Use the following R code to load the alarm dataset:

library (bnlearn) # load the data. data(alarm) summary(alarm)

The true network structure of this dataset can be viewed (plot) using the following R code.

Use R programming, as appropriate, to answers the following questions.

4.1) Use the alarm dataset to learn Bayesian network structures using hill-climbing (hc) algorithm, utilizing two different scoring methods, namely Bayesian Information Criterion score (BIC score) and the Bayesian Dirichlet equivalent score (Bde score), for each of the following sample sizes of the data:

a) 100 (first 100 data)

b) 1000 (first 1000 data)

c) 15000 (first 15000 data) For each of the above cases,

· provide the scores obtained for BIC and BDe,

· Plot the network structure obtained for the BIC and BDe scores.

4.2) Based on the results obtained for the above question (Q 4.1), discuss how the BIC score compare with BDe score for different sample sizes in terms of structure and score of the learned network.

4.3)

a) Find the Bayesian network structures utilising the full dataset, and using both BIC and Bde scores. Show the scores and the obtained networks.

b) Compare the networks obtained above (in Q4.3.a) for each BIC and Bde scoring methods with the true network structure and comment. Use the “compare()” function and “graphviz.compare()” function available in the “bnlearn” R package to perform these comparisons and comment.

c) Fit the data to the network obtained using the BIC score in the above question (Q4.3.a) in order to compute the conditional probability distribution table entries (CPD table values). Show the obtained CPD table entries for the variable “ECO2”.

d) Use the above learned network obtained (in Q4.3.c) to find the probability of :

P(BP="HIGH" | STKV ="LOW", HR ="NORMAL", SAO2="NORMAL").

[Marks (3*4) + 3 + (4+3+2+2) = 26]

Q5) Research based questions (Practical applications in real world – Bayesian network) [38 Marks]

This is a HD (High Distinction) level question. Those students who target HD grade should answer this question (including answering all the above questions). For others, this question is an option. This question aims to demonstrate your expertise in the subject area and the ability to do your own research in the related area.

a) Download the following article from the link provided below. Read that article and answer the following questions. This article provides a real life case study on creating and using a Bayesian network for road accident data analysis.

Ali Karimnezhad & Fahimeh Moradi (2017), Road accident data analysis using Bayesian networks, Transportation Letters, 9:1, 12-19,

DOI: 10.1080/19427867.2015.1131960

Web: https://www.tandfonline.com/doi/full/10.1080/19427867.2015.1131960

Note that you will be able to download this paper via Deakin library using your Deakin credentials (username and password). (https://www.deakin.edu.au/library/help/add-browser-bookmarklet)

i) Describe the dataset used for their analysis. What are the variables used? Are the variables numerical or categorical or mixed? How many records of data have been used?

ii) What is the name of the algorithm used for learning the Bayesian network structure? What software tool have been used to build and visualize the Bayesian network? Provide a web link to that software.

iii) Read the section titled “Parameter learning in the road accident network” in that paper and extract the following probability values that they have computed, and mention them:

I. The probability of being injured while wearing seat belt and driving a car, knowing that the driver has a diploma degree and a type 2 driving license.

II. The probability of death while not wearing the seatbelt, knowing that the driver has a diploma degree and a type 2 driving license

III. The probability of being injured while not wearing the seatbelt, knowing that the driver has a diploma degree and a type 2 driving license

IV. The probability of death while wearing seat belt and driving a car, knowing that the driver has a diploma degree and a type 2 driving license

V. Based on the probability values obtained above, what conclusions are made?

b) Read the following article that explains modeling air pollution, climate and health data using Bayesian network.

Vitolo C., Scutari M., Ghalaieny M., Tucker A., & Russell A. (2018). Modeling air pollution, climate, and health data using Bayesian

Networks: A case study of the English regions. Earth and Space Science, 5, 76–88. https://doi.org/10.1002/2017EA000326

Note that you will be able to download this paper via Deakin library using your Deakin credentials (username and password). (https://www.deakin.edu.au/library/help/add-browser-bookmarklet

This paper has used pollution data, weather data and health data to produce the Bayesian network to study the link between pollution and health. It used a big data for analysis. Table 1 of this paper explains the variables considered for their analysis.

In this task you will be implementing and performing a similar modelling using a small data set from Australia. You will use the pollution data, weather data and health data from Australia to produce the Bayesian network. You can choose either a state or the whole of Australia for your analysis. In your task, instead of using big data, you should be using a very small data of size (of your choice) that can be used/loaded on to your computer and run. So, choose the data size wisely.

You need to perform the following tasks and prepare a report explaining the details of the full implementation and results. You should also upload the programming code that you used to implement and produce your results, along with the details which explains how to run the code including all of the relevant packages/libraries to use (do not upload packages; only the links/library names needed so that it can be loaded and executed).

· Find the appropriate data and clean them (see below for some suggestions). In the report, you should clearly explain what data you used, how you cleaned them and how you handled the missing data, if any, and other processing performed in preparing the data. Include details about the period of data considered.

· Consider no more than 20 variables for your task. Explain the variables you have chosen in the report. You can either choose to convert the continuous variables into discrete variable and perform the analysis (produce the Bayesian network), or you can use a mixed category of variables (continuous and categorical) for your analysis. This is your choice, and it should be explained clearly in the report.

· Perform an exploratory analysis on the variables you have selected. Provide some relevant visualizations, such as histogram plots, and summary statistics for the variables considered (relevant results may be presented in a table from), and briefly describe them.

· Perform appropriate Bayesian structure learning and parameter learning to produce the Bayesian networks using your data. You may choose to include black lists (i.e., using prior knowledge to guide the learning by excluding certain edges) in your structure learning process, if needed. Experiment with at least 2 or more methods and compare the Bayesian networks produced. You may choose to use the latest algorithms published in academic journal/conference papers recently on Bayesian structure learning in your analysis. Note that, including more on the recent/latest algorithms will increase the chances of scoring more marks for this task. Implement the relevant codes in R and produce the Bayesian network. You may choose to use any existing code and modify it to suit your needs, but proper referencing and comments should be included, along with clearly explaining how it is used, and where the changes are made. The code file should be uploaded along with the submission. The report should have clear explanations (technical details) for the algorithms used.

· Report should explain the methods used to produce the Bayesian network, the

settings/parameters/metrics used, if any, the results obtained, including the Bayesian network structure, and a clear discussion/comments on the obtained network. The report should have enough details so that the results can be reproduced exactly based on the reported details.

· Report for this question, Q5 (b), should not exceed a maximum of 4 pages, including, figures, tables and references. Note that the report should be clearly and neatly written and presented with proper subheadings with details, including appropriate tables, diagrams, plots, results, etc. The paper above is a good example for how to present the details clearly and professionally.

You are free to choose a suitable time period for your analysis, depending on the data availability, for example three years.

The following links might be helpful to find the relevant data (do your own search, and you are free choose other relevant sites as appropriate for your data and analysis; remember to explain them in the report with references):

Pollution data:

· https://aqicn.org/map/australia/

· https://www.epa.vic.gov.au/for-community/airwatch

· Choose the number of stations wisely, depending on the region you consider for analysis.

· https://www.dpie.nsw.gov.au/air-quality/air-quality-concentration-data- updated-hourly

Weather Data:

· http://www.bom.gov.au/climate/data/

Health date:

· Underlying causes of death (Australia) https://www.abs.gov.au/statistics/health/causes-death/causes-death- australia/2019/3303_1%20Underlying%20causes%20of%20death%20%28A ustralia%29.xlsx

You can consider the ICD10 codes (International Classification for Diseases codes) for cardiovascular-pulmonary diseases (CVD), which is “J00-J99”. This code accounts for Diseases of respiratory system. For example, in Table 1.2 (tab) of the Excel sheet (from the above link), you can see yearly data (see line/row number 786) for Diseases of respiratory system.

NOTE: Your report for all of the above questions must be written on your own words. Copying directly from the paper/reference text or if any sign of collusion is detected, zero marks will be given and reported for Plagiarism and collusion processing.

Saturday, 12 June 2021

Bayesian Learning and Graphical Models Assignment

=================================================================

Figure 1

G, S, D.

Q2) [20 Marks] Implementing a Bayesian network in R and performing inference

a) 100 (first 100 data)

c) 15000 (first 15000 data) For each of the above cases,

Vitolo C., Scutari M., Ghalaieny M., Tucker A., & Russell A. (2018). Modeling air pollution, climate, and health data using Bayesian

Pollution data:

· https://www.epa.vic.gov.au/for-community/airwatch

· https://www.dpie.nsw.gov.au/air-quality/air-quality-concentration-data- updated-hourly

NOTE: Your report for all of the above questions must be written on your own words. Copying directly from the paper/reference text or if any sign of collusion is detected, zero marks will be given and reported for Plagiarism and collusion processing.

0 comments:

Post a Comment

Instagram

Popular Posts

Follow us on Facebook

Featured post

Faculty of CEM – Coursework Brief 2020/21

Labels

Search This Blog

Blog Archive

Report Abuse

Featured Posts

Featured Posts

About Me