ELEC-E7130 - Internet Traffic Measurements and Analysis, Lecture, 7.9.2022-5.12.2022
This course space end date is set to 05.12.2022 Search Courses: ELEC-E7130
Assignment 6: Distributions and Sampling
Prerequisites
To complete this assignment, students require to have prior knowledge about how to use R or Python, statistics, and how to leverage the libraries available to process, analyze and plot data.
If you lack the relevant skills, you may want to
Take a look through related slides.
Refer earlier assignments where you can learn some R and Python knowledge.
Read the supporting materials.
Learning outcomes
At the end of this assignment, students should be able to
- Understand the purpose of using distributions
- Get to know about the different probability distributions available
- Find the best distribution of unknown datasets by fittings and more
- Validate the distribution through appropriate plots
- Have a good understanding of how to sample their data set for simplicity in computations
Introduction
The present assignment covers the main topics related to probability distributions, fitting different distributions, validating the model of the distribution and the first steps to sampling to be aware of the utility and simplicity of computations. This assignment contains three tasks:
Thereby, the students must understand how to find a distribution of unknown datasets by fittings and more as well as learn the importance, advantage, and disadvantage of using samples taken from the data set.
All data can be found from sampling-data.zip
archive located in the directory /work/courses/unix/T/ELEC/E7130/general/r-data
or, using the environment variable, $RDATA/sampling-data.zip
(extracted into the directory $RDATA/sampling/
) at Aalto IT computers.
Note: You may type the command
kinit
before accessing to the directory to avoid issues related to the permissions.
source /work/courses/unix/T/ELEC/E7130/general/use.sh
$ cd $RDATA
$ ls
$ ...
Task 1: Introduction to distribution
In the first task, answer the following questions:
Define the concept distribution in terms of statistical data and why are distributions important.
Mention at least three continuos probability distributions explaining their respective parameters and typical applications.
Briefly explain what the next goodness-of-fit plots consist of in order to validate a model.
- Probability-Probability (P-P) plot
- Quantile-Quantile (Q-Q) plot
- Comparison of probability density (empirical and theoretical)
- Comparison of cumulative distribution function (empirical and theoretical)
Report, task 1:
- Discussion on concepts related to distribution.
Task 2: Distributions
The present task addresses the modeling of measurement data with distributions.
Note: There are several benefits to find suitable distribution to fit the data. For example, distributions will briefly describe the underlying data values and could also be utilized to generate new data to have a larger dataset in certain cases. Furthermore, some learning algorithms assume some distribution to fit the data, which can help us understand the low-level details of how the learning algorithms work.
Download the three data sets are drawn from certain distributions presented at the lectures, which are as follows:
distr_a.txt
distr_b.txt
distr_c.txt
Study each dataset to choose a good distribution for it.
- Estimate the parameters of distribution with software.
- Validate your model by using appropriate plots
- Explain your modeling choices and why.
Tips: - Useful Python functions could be
distfit()
. - Useful R functions could befitdist()
.
Report, task 2
For each dataset:
- What distribution was chosen and why.
- Parameters of distribution.
- Validation with plots.
- Explain your choices.
Reminder: Document the process and operations.
Task 3: Sampling
This task aims to allow you to practice random sampling and analyze its results. In general, sampling serves some purposes, such as handling an enormous amount of data or balancing the number of class instances for machine learning.
Download the file flowdata.txt
which contains the following information for a set of flows as seen before:
- Source IP (Anonymized)
- Destination IP (Anonymized)
- Protocol
- Is the port number valid
- Source port
- Destination port
- Number of packets
- Number of bytes
- Number of flows
- First packet arrival time
- Last packet arrival time
Complete the following tasks:
- Overview of the data set
- Select 1000 random sample data.
- Produce a parallel plot to get an overview of the data.
- Overview of the data set with source port 80 (WWW)
- Select the flows with source port 80.
- Produce a parallel plot to get an overview of the data.
- Number of bytes against packets
- Create a scatterplot (bytes vs packets) of the original data set and use logarithmic data if needed.
- Create a scatterplot (bytes vs packets) of 1000 random sample data and use logarithmic data if needed
- How are they related?
- What is the maximum average packet size for both (original data set and 1000 random sample data)?
Note: The average packet size of a flow is calculated with the number of bytes in a flow divided by its number of packets, that is, as the formula below:
Average_packet_size_of_a_flow = total_bytes_of_a_flow / total_packets_of_a_flow - Average throughput
- Calculate the average throughput of the connections. Clock resolution introduces some challenges, what can be said on the throughput of the flows that are transferred in zero time?
Note: The average throughput of a flow is the number of bytes transferred divided by the transfer time, that is, the difference between the arrival time of the last packet and the first packet.
- Study the average throughput both the original data set and 1000 random samples data. State your own analysis on the data.
- Draw conclusions about your own observations on the data analyzed (original and random samples) and the usefulness of the graphs used.
Tips: - Useful Python functions could be
pandas.sample()
,pandas.plotting.parallel_coordinates()
,matplotlib.pyplot.plot()
. - Useful R functions could besample()
,ggparcoord()
,plot()
.
Report, task 3
- Plots requested above with commands used to generate them
- Analysis of how bytes and packets are related.
- Throughput analysis.
- Conclusions
Grading standard
To pass this course, you need to achieve at least 15 points in this assignment. And if you submit the assignment late, you can get a maximum of 15 points.
You can get up to 30 points for this assignment:
Task 1
- Answer the questions appropriately. (9p)
Task 2
- Choose suitable distributions for the three data sets and explain why. (5p)
- Determine their parameters. (3p)
- Use graphs to verify your conjecture and explain them. (3p)
Task 3
- Plot two parallel plots as required. (3p)
- Draw a scatter plot as required and analyze. (3p)
- Calculate average throughput and analyze. (2p)
- Summary of observations on the data. (2p)
The quality of the report (bonus 2p)
The instruction of assignment
For the assignment, your submission must contain (Please don’t contain original data in your submission):
- A zip file that includes your codes and scripts.
- A PDF file as your report.
Regarding the report, your report must have:
- A cover page indicating your name, student ID and your e-mail address.
- The report should include a description of measurements, a summary of the results and conclusions based on the results.
- An explanation of each problem, explain how you solved it and why you did it.
- 26 October 2022, 3:27 PM