Form of assessment:
Individual work Group work
For group work assessment which requires members to submit both
individual and group work aspects for the assignment, the work should be
submitted as:
Consolidated single
document Separately
by each member
Number of assignment copies required:
1 2 Other
Assignment
to be presented in the following format:
On-line submission
Stapled
once in the top left-hand corner Glue bound
Spiral
bound
Placed in a A4 ring bound folder (not lever
arch)
Note: To students submitting work on
A3/A2 boards, work has to be
contained in suitable protective case
to ensure any damage to work is avoided.
Soft copy:
CD (to be attached to the work in an envelope
or purpose made wallet adhered to
the rear)
USB (to be attached
to the work in an envelope or purpose made wallet adhered to the rear)
Soft
copy not required
CN7031 - Big Data Analytics
Group assignment
2020-21 Academic Year
This coursework (CRWK) must be attempted in the groups
of 4 or 5 students. This coursework is divided into two sections: (1)
Big Data analytics on a real case study and (2) group presentation. All the
group members must attend the
presentation. Presentation would be online through Microsoft Teams. If you do not turn
up in the presentation date with the video call, you will fail the module.
Overall mark
for CRWK comes from two main activities as follows:
1- Big Data Analytics report (around 3,000 words, with a tolerance of ± 10%) in HTML format (60%)
2- Presentation (40%)
Marking Scheme
IMPORTANT: you must use CRWK
template in the HTML format, otherwise it will be counted as plagiarism and
your group mark would be zero. Please refer to the “THE FORMAT OF FINAL
SUBMISSION” section.
Good Luck!
Big Data Analytics using Spark
CN7031 – Big Data Analytics
(1) Understanding Dataset: CSE-CIC-IDS20181
This dataset was originally created by the University of
New Brunswick for analyzing DDoS data. You can find the full dataset and its
description here. The dataset itself was based on logs of the
university's servers, which found various DoS attacks throughout the publicly
available period to generate totally 80 attributes with 6.40GB size. We will
use about 2.6GB of the data to
process it with the restricted PCs to 4GB RAM. Download it from here.
When writing machine learning or statistical analysis for this data, note that
the Label column is arguably the most important portion
of data, as it determines if the packets
sent are malicious or not.
a)
The features are described in
the “IDS2018_Features.xlsx” file in Moodle page.
b)
The labels are as follows:
·
“Label”: normal traffic
·
“Benign”: susceptible to DoS attack
b) In this coursework, we use more than 8.2-million records with the
size of 2.6GB. As a big data specialist, firstly, we should read and understand
the features, then apply modeling techniques. If you want to see a few records of this dataset,
you can either use [1] Hadoop
HDFS and Hive, [2] Spark SQL or [3] RDD for printing
a few records for your understanding.
(1)
Big Data Query & Analysis
using Spark SQL [30 marks]
This task is using Spark SQL for converting big sized raw data into useful information. Each member of a group should implement 2 complex SQL queries
(refer to the marking scheme). Apply appropriate visualization tools to
present your findings numerically and graphically. Interpret shortly your findings.
You can use https://spark.apache.org/docs/3.0.0/sql-ref.html for
more information.
·
What do you need to put in the HTML report per student?
1.
At least two Spark SQL queries.
2.
A short explanation of the queries.
3.
The working solution, i.e.,
plot or table.
·
Tip: The mark for this section depends
on the level of your queries complexity, for instance using the simple select query is not supposed for a full mark.
This task is using Spark SQL for converting big sized raw data into useful information. Each member of a group should implement 2 complex SQL queries
(refer to the marking scheme). Apply appropriate visualization tools to
present your findings numerically and graphically. Interpret shortly your findings.
You can use https://spark.apache.org/docs/3.0.0/sql-ref.html for
more information.
·
What do you need to put in the HTML report per student?
1.
At least two Spark SQL queries.
2.
A short explanation of the queries.
3.
The working solution, i.e.,
plot or table.
·
Tip: The mark for this section depends
on the level of your queries complexity, for instance using the simple select query is not supposed for a full mark.
(2) Advanced Analytics using PySpark [60 marks]
In this section, you will conduct advanced analytics using PySpark.
In this section, you will conduct advanced analytics using PySpark.
3.1. Analyze and Interpret Big Data using PySpark (45 marks)
Every member of a group should analyze data through 3
analytical methods (e.g., advanced descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You need
to present your work numerically and graphically. Apply tooltip text, legend,
title, X-Y labels etc. accordingly.
Note: we need a working solution without system or logical error for the
good/full mark.
Every member of a group should analyze data through 3
analytical methods (e.g., advanced descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You need
to present your work numerically and graphically. Apply tooltip text, legend,
title, X-Y labels etc. accordingly.
Note: we need a working solution without system or logical error for the
good/full mark.
3.2. Design and Build a Machine Learning (ML) technique (15 marks)
Every member of a group should go over https://spark.apache.org/docs/3.0.0/ml-guide.html and apply one ML technique. You can apply one
the following approaches: Classification, Regression, Clustering,
Dimensionality Reduction, Feature Extraction, Frequent Pattern mining or
Optimization. Explain and evaluate your model and its results into the
numerical and/or graphical representations.
Note: If you are 4 students in a group, you should develop 4 different
models. If you have a similar model, the mark would be zero.
Every member of a group should go over https://spark.apache.org/docs/3.0.0/ml-guide.html and apply one ML technique. You can apply one
the following approaches: Classification, Regression, Clustering,
Dimensionality Reduction, Feature Extraction, Frequent Pattern mining or
Optimization. Explain and evaluate your model and its results into the
numerical and/or graphical representations.
Note: If you are 4 students in a group, you should develop 4 different
models. If you have a similar model, the mark would be zero.
0 comments:
Post a Comment