·
For this assignment, you need
to submit the following TWO files.
1. A written document (A single pdf only) covering all of
the items described in the questions. All answers to the questions must be
written in this document, i.e, not in
the other files (code files) that you will be submitting. All the relevant results
(outputs, figures) obtained by executing your R code must be included in this
document.
For questions that involve
mathematical formulas, you may write the answers manually (hand written
answers), scan it to pdf and combine with your answer document.
Submit a combined single pdf of your
answer document.
2. A separate “.R” file or
‘.txt’ file containing your code (R-code script) that you implemented to
produce the results. Name the file as “name-StudentID-Ass2-
Code.R" (where `name' is
replaced with your name - you can use your surname or first name, and StudentID with your student ID).
·
All the documents and files
should be submitted (uploaded) via SIT
743 Clouddeakin Assignment Dropbox by the due date and time.
·
Zip files are NOT accepted. All two
files should be uploaded separately to
the CloudDeakin.
·
E-mail or manual submissions are NOT allowed. Photos
of the document are NOT
allowed.
=================================================================
Assignment tasks
Q1) [40 Marks]
Melbourne City council conducted a survey to study the
relationship between the type of dwelling and the income profile of the people
living in Melbourne city. A list of factors that influence the type of
dwelling, along with their possible values, and a Bayesian network that
represents the relationship between these factors (variables) are given below.
E (Education) ∈ {Graduate,
Non-graduate}
G (Gender) ∈ {
Male, Female}
A (Age) ∈ {35
or less, 35+ }
S (Salary) ∈ {$50k
or less per annum, $50k+ per annum}
J (Job type) ∈ {Professional,
Laborer, Student, Unemployed}
M (Marriage) ∈ {Married,
Widowed, Single}
D (Dwelling) ∈ {Rent,
Own house}
Figure
1
1.1)
Write
down the joint distribution P(E, G, S, J, A, M, D) for the
above network.
1.2)
Find the minimum number of
parameters required to fully specify the distribution according to the above network.
1.3)
a) Write down a joint probability density function if there are no
independence among the variables is assumed.
b)
How many parameters are
required, at a minimum, if there are no
independencies among the variables is assumed?
c)
Compare with the result of the
above question (Q1.2) and comment.
1.4)
The Melbourne city council,
from a previous study, found out that the Marriage
is conditionally independent of job type, given the salary and Age. The
Melbourne city council wants to modify the Bayesian network given in Figure 1 by incorporating this new
information. Assume now that the Marriage is conditionally independent of
job type, given the salary and Age, perform the following.
a) What change will happen to the Bayesian network (shown in Figure 1)
when the above assumption is considered. Draw the new Bayesian network considering the above assumption (you may draw
this by hand).
b) Compute the change in the minimum number of parameters required for
this new Bayesian network, compared
to the minimum number of parameters required for the Bayesian network shown in
Figure 1. Comment on the results.
1.5)
d-separation method can be used to find two sets of independent or conditionally
independent variables in a Bayesian network. Use the Bayesian network given in Figure 1 to answer the following:
For each of the
statements/questions given below from (a) to (b), perform the following:
·
List all the possible paths from the first (set of) node/s to the second
(set of) node/s considered for the independence check.
·
State if each of those paths is blocking or non-blocking with reasons.
·
Hence, answer the question
about independence.
a)
Is dwelling (D) conditionally independent of gender (G) given salary (S)
and job type (J)?
b)
Is {E, G}
⊥ A | {D, J} ?
1.6)
For the Bayesian network shown
in Figure 1, find all the nodes that are conditionally independent of education (E) given marriage
(M).
1.7)
Write a R-Program to produce
the Bayesian network shown in Figure 1, and perform the d-separation tests for
cases given below. Show the plot of the
network you obtained and the output
(of d-separation test) from your program.
a) E ⊥ {A, G} | {S, M}
b) {S, A} ⊥ G | {E, J, D}
1.8)
For the Bayesian network shown
in Figure 1,
a) find the Markov blanket of job
type (J).
b) find all the nodes that are conditionally independent of job type (J) given its Markov
blanket.
c) use R program to find the Markov blanket of salary (S). Plot the Bayesian network and show the Markov blanket
nodes in the network using different colour.
1.9) For the Bayesian network shown in Figure 1,
a)
show
the step by step process to perform variable
elimination to compute P(J | M = Married, A
= 35+, E = Graduate). Use
the following variable ordering for the elimination process:
G, S, D.
b) what is the treewidth of the network, given the above elimination
ordering?
[Marks 2+4+5+5+8+2+4+5+5 = 40]
Q2) [20 Marks] Implementing
a Bayesian network in R and performing inference
A belief network models the relation between the T, W,
H, R and S, which represents the temperature,
wind speed, humidity, precipitation, and solar radiation respectively. Each
variable takes different states as given below.
T (teneerature) ∈ {cold, hot}
W (wind seeed) ∈ {low, nediun, ℎigℎ} H (ℎunidity) ∈ {low, ℎigℎ}
R (erecieitation) ∈ {low, ℎigℎ}
S (solar radiation) ∈ {low, nediun, ℎigℎ}
The belief network that models these variables has (probability)
tables as shown below.
2.1)
Use the below libraries in R to
create this belief network in R along with the probability values, as shown in
the above table.
You may use the following libraries for this: #https://www.bioconductor.org/install/
#BiocManager::install(c("gRain", "RBGL", "gRbase"))
#BiocManager::install(c("Rgraphviz")) library("Rgraphviz")
library(RBGL) library(gRbase) library(gRain)
#define the appropriate network and use the
“compileCPT()”function
to Compile list of conditional probability tables, and create the network.
a) Show the obtained belief
network for this distribution
b) Show the probability tables obtained
from the R output, (and verify with the above table).
2.2)
Use R program to compute the
following probabilities:
a) Given that the temperature is
cold, what is the probability that humidity
is
high?
b)
Find the joint distribution of temperature, humidity and precipitation.
c) Given that the wind speed is
medium and the precipitation is high,
what is the probability that the sloar
radiation is high?
d)
Find the marginal
distribution of precipitation.
e) Find P(R=high | T=cold, H=high)
f) Find P(R=high | T=cold, H=high, S=low)
g) Find P(R=high | T=cold, H=high, W=medium)
h) Compare the results obtained in Q2.2 e), Q2.2 f), and Q2.2 g) above.
Explain the reason for the observed behavior.
[Marks:
(3+3) + (2+2+2+2+1+1+1+3) = 20]
Q3) [16 Marks]
Consider five binary
variables A, B, C, D and E. The Directed Acyclic Graph (DAG) shown below
describes the relationship between these variables along with their conditional
probability tables (CPT).
3.1)
Obtain an expression (in a
simplified form) for
P(C = 1|A = 0, B =
0, D = 1, E = 0). (Show the steps clearly).
3.2)
The table shown below provides 20 simulated data obtained for
the above Bayesian network. Use this data to find the maximum likelihood
estimates of α, θ,
þ, h
and a.
3.3)
Find
the value of P(C = 1|A = 0, B = 0, D = 1, E = 0) using the appropriate values obtained
from the above question Q3.2.
[Marks 9 + 5
+ 2 = 16]
Q4) Bayesian
Structure Learning [26 Marks]
For this question, you will be using
a dataset, called “alarm” available from the ‘bnlearn’ R package. which contains
37 variables. This provide an alarm message system for patient
monitoring.
Use the following R code to load the alarm dataset:
library (bnlearn) # load the data. data(alarm)
summary(alarm)
The true network structure of this dataset can be viewed (plot)
using the following R code.
Use R
programming, as appropriate, to answers the following questions.
4.1)
Use the alarm dataset to learn Bayesian network
structures using hill-climbing (hc) algorithm, utilizing two different
scoring methods, namely Bayesian
Information Criterion score (BIC score) and the Bayesian Dirichlet equivalent score (Bde score), for each of the following
sample sizes of the data:
a) 100 (first 100 data)
b)
1000 (first 1000 data)
c) 15000 (first 15000 data) For each of the above cases,
·
provide the scores obtained for
BIC and BDe,
·
Plot the network structure
obtained for the BIC and BDe scores.
4.2)
Based on the results obtained
for the above question (Q 4.1), discuss how the BIC score compare with BDe score for different sample sizes in terms
of structure and score of the learned network.
4.3)
a)
Find the Bayesian network
structures utilising the full dataset,
and using both BIC and Bde scores.
Show the scores and the obtained networks.
b)
Compare the networks obtained above (in
Q4.3.a) for each BIC and Bde scoring methods with the true network structure and
comment. Use the “compare()”
function and “graphviz.compare()” function available in the “bnlearn” R package
to perform these comparisons and comment.
c)
Fit the data to the network
obtained using the BIC score in the
above question (Q4.3.a) in order to compute the conditional probability
distribution table entries (CPD table values). Show the obtained CPD table
entries for the variable “ECO2”.
d)
Use the above learned network obtained
(in Q4.3.c) to find the probability of :
P(BP="HIGH" | STKV ="LOW", HR ="NORMAL", SAO2="NORMAL").
[Marks (3*4) + 3 + (4+3+2+2) = 26]
Q5) Research based questions (Practical applications in real world
– Bayesian network) [38 Marks]
This is a HD (High
Distinction) level question. Those students who target HD grade should answer
this question (including answering all the above questions). For others, this
question is an option. This question aims to demonstrate your expertise in the
subject area and the ability to do your own research in the related area.
a)
Download the following article
from the link provided below. Read that article and answer the following
questions. This article provides a real life case study on creating and using a
Bayesian network for road accident data analysis.
Ali Karimnezhad & Fahimeh Moradi (2017), Road accident data analysis using Bayesian networks, Transportation
Letters, 9:1, 12-19,
DOI: 10.1080/19427867.2015.1131960
Web: https://www.tandfonline.com/doi/full/10.1080/19427867.2015.1131960
Note that you will be able to download this
paper via Deakin library using your Deakin credentials (username and
password). (https://www.deakin.edu.au/library/help/add-browser-bookmarklet)
i)
Describe the dataset used for
their analysis. What are the variables used? Are the variables numerical or categorical or
mixed? How many records of data have been used?
ii)
What is the name of the
algorithm used for learning the Bayesian network structure? What software tool
have been used to build and visualize the Bayesian network? Provide a web link
to that software.
iii)
Read the section titled “Parameter learning in the road accident
network” in that paper and extract the following probability values that
they have computed, and mention them:
I.
The probability of being
injured while wearing seat belt and driving a car, knowing that the driver has
a diploma degree and a type 2 driving license.
II.
The probability of death while
not wearing the seatbelt, knowing that
the driver has a diploma degree and a type 2 driving license
III.
The probability of being
injured while not wearing the seatbelt, knowing that the driver has a diploma
degree and a type 2 driving license
IV.
The probability of death while
wearing seat belt and driving a car, knowing that the driver has a diploma
degree and a type 2 driving license
V.
Based on the probability values
obtained above, what conclusions are made?
b) Read the following article that explains modeling air pollution, climate
and health data using Bayesian network.
Vitolo C.,
Scutari M., Ghalaieny M., Tucker A., & Russell A. (2018). Modeling air
pollution, climate, and health data using Bayesian
Networks: A case study of the English regions. Earth and Space
Science, 5, 76–88. https://doi.org/10.1002/2017EA000326
Note that you will be able to download this
paper via Deakin library using your Deakin credentials (username and
password). (https://www.deakin.edu.au/library/help/add-browser-bookmarklet
This paper has used pollution data, weather data and
health data to produce the Bayesian network to study the link between pollution
and health. It used a big data for analysis. Table 1 of this paper explains the
variables considered for their analysis.
In this task you will be implementing and performing a
similar modelling using a small data set from
Australia. You will use the pollution data, weather data and health data from
Australia to produce the Bayesian network. You can choose either a state or the
whole of Australia for your analysis. In your task, instead of using big data,
you should be using a very small data of size (of your choice) that can be
used/loaded on to your computer and run. So, choose the data size wisely.
You need to perform the following tasks and prepare a report explaining the details of the
full implementation and results. You should also upload the programming code that you used to
implement and produce your results, along with the details which explains how
to run the code including all of the relevant packages/libraries to use (do not
upload packages; only the links/library names needed so that it can be loaded
and executed).
·
Find the appropriate data and
clean them (see below for some suggestions). In the report, you should clearly explain what data you used, how you
cleaned them and how you handled the missing data, if any, and other processing
performed in preparing the data. Include details about the period of data
considered.
·
Consider no more than 20
variables for your task. Explain the variables you have chosen in the report.
You can either choose to convert the continuous variables into discrete
variable and perform the analysis (produce the Bayesian network), or you can
use a mixed category of variables (continuous and categorical) for your
analysis. This is your choice, and it should be explained clearly in the report.
·
Perform an exploratory analysis
on the variables you have selected. Provide some relevant visualizations, such
as histogram plots, and summary statistics for the variables considered
(relevant results may be presented in a table from), and briefly describe them.
·
Perform appropriate Bayesian
structure learning and parameter learning to produce the Bayesian networks
using your data. You may choose to include black
lists (i.e., using prior knowledge to guide the learning by excluding
certain edges) in your structure learning process, if needed. Experiment
with at least 2 or more methods and
compare the Bayesian networks produced. You may choose to use the latest
algorithms published in academic journal/conference papers recently on Bayesian
structure learning in your analysis. Note that, including more on the
recent/latest algorithms will increase the chances of scoring more marks for
this task. Implement the relevant codes in R and produce the Bayesian network.
You may choose to use any existing code and modify it to suit your needs, but
proper referencing and comments should be included, along with clearly
explaining how it is used, and
where the changes are made. The code
file should be uploaded along with the submission. The report should have clear
explanations (technical details) for the algorithms used.
·
Report should explain the methods used to produce the Bayesian
network, the
settings/parameters/metrics used, if any, the results
obtained, including the Bayesian network structure, and a clear discussion/comments
on the obtained network. The report should have enough details so that the
results can be reproduced exactly based on the reported details.
·
Report for this question, Q5 (b), should
not exceed a maximum of 4 pages,
including, figures, tables and references. Note that the report should be
clearly and neatly written and presented with proper subheadings with details,
including appropriate tables, diagrams, plots, results, etc. The paper above is
a good example for how to present the details clearly and professionally.
You are free to choose a suitable time period for your
analysis, depending on the data availability, for example three years.
The following links might be helpful to find the
relevant data (do your own search, and you are free choose other relevant sites
as appropriate for your data and analysis; remember to explain them in the
report with references):
Pollution data:
·
https://aqicn.org/map/australia/
·
Choose the number of stations wisely, depending on the region you
consider for analysis.
Weather Data:
·
http://www.bom.gov.au/climate/data/
Health
date:
·
Underlying causes of death (Australia) https://www.abs.gov.au/statistics/health/causes-death/causes-death- australia/2019/3303_1%20Underlying%20causes%20of%20death%20%28A
ustralia%29.xlsx
You can consider the ICD10 codes (International
Classification for Diseases codes) for cardiovascular-pulmonary diseases (CVD),
which is “J00-J99”. This code
accounts for Diseases of respiratory
system. For example, in Table 1.2 (tab) of the Excel sheet (from the above
link), you can see yearly data (see line/row number 786) for Diseases of respiratory system.
NOTE: Your report for all
of the above questions must be written on your own words. Copying directly from
the paper/reference text or if any sign of collusion is detected, zero marks
will be given and reported for Plagiarism and collusion processing.