Home Schools Course feedback Service Links Intelliboard

This course space end date is set to 05.12.2022 Search Courses: ELEC-E7130

Assignment 6: Distributions and Sampling

Due: Wednesday, 9 November 2022, 10:00 PM

Prerequisites

To complete this assignment, students require to have prior knowledge about how to use R or Python, statistics, and how to leverage the libraries available to process, analyze and plot data.

If you lack the relevant skills, you may want to
1. Take a look through related slides.
2. Refer earlier assignments where you can learn some R and Python knowledge.
3. Read the supporting materials.

Learning outcomes

At the end of this assignment, students should be able to

Understand the purpose of using distributions
Get to know about the different probability distributions available
Find the best distribution of unknown datasets by fittings and more
Validate the distribution through appropriate plots
Have a good understanding of how to sample their data set for simplicity in computations

Introduction

The present assignment covers the main topics related to probability distributions, fitting different distributions, validating the model of the distribution and the first steps to sampling to be aware of the utility and simplicity of computations. This assignment contains three tasks:

Task 1: Introduction to distribution
Task 2: Distributions
Task 3: Sampling

Thereby, the students must understand how to find a distribution of unknown datasets by fittings and more as well as learn the importance, advantage, and disadvantage of using samples taken from the data set.

All data can be found from sampling-data.zip archive located in the directory /work/courses/unix/T/ELEC/E7130/general/r-data or, using the environment variable, $RDATA/sampling-data.zip (extracted into the directory $RDATA/sampling/) at Aalto IT computers.

Note: You may type the command kinit before accessing to the directory to avoid issues related to the permissions.

$ source /work/courses/unix/T/ELEC/E7130/general/use.sh
$ cd $RDATA
$ ls
...

Task 1: Introduction to distribution

In the first task, answer the following questions:

Define the concept distribution in terms of statistical data and why are distributions important.
Mention at least three continuos probability distributions explaining their respective parameters and typical applications.
Briefly explain what the next goodness-of-fit plots consist of in order to validate a model.
- Probability-Probability (P-P) plot
- Quantile-Quantile (Q-Q) plot
- Comparison of probability density (empirical and theoretical)
- Comparison of cumulative distribution function (empirical and theoretical)

Report, task 1:

Discussion on concepts related to distribution.

Task 2: Distributions

The present task addresses the modeling of measurement data with distributions.

Note: There are several benefits to find suitable distribution to fit the data. For example, distributions will briefly describe the underlying data values and could also be utilized to generate new data to have a larger dataset in certain cases. Furthermore, some learning algorithms assume some distribution to fit the data, which can help us understand the low-level details of how the learning algorithms work.

Download the three data sets are drawn from certain distributions presented at the lectures, which are as follows:
- distr_a.txt
- distr_b.txt
- distr_c.txt
Study each dataset to choose a good distribution for it.
- Estimate the parameters of distribution with software.
- Validate your model by using appropriate plots
- Explain your modeling choices and why.

Tips: - Useful Python functions could be distfit(). - Useful R functions could be fitdist().

Report, task 2

For each dataset:

What distribution was chosen and why.
Parameters of distribution.
Validation with plots.
Explain your choices.

Reminder: Document the process and operations.

Task 3: Sampling

This task aims to allow you to practice random sampling and analyze its results. In general, sampling serves some purposes, such as handling an enormous amount of data or balancing the number of class instances for machine learning.

Download the file flowdata.txt which contains the following information for a set of flows as seen before:

Source IP (Anonymized)
Destination IP (Anonymized)
Protocol
Is the port number valid
Source port
Destination port
Number of packets
Number of bytes
Number of flows
First packet arrival time
Last packet arrival time

Complete the following tasks:

Overview of the data set
- Select 1000 random sample data.
- Produce a parallel plot to get an overview of the data.
Overview of the data set with source port 80 (WWW)
- Select the flows with source port 80.
- Produce a parallel plot to get an overview of the data.
Number of bytes against packets
- Create a scatterplot (bytes vs packets) of the original data set and use logarithmic data if needed.
- Create a scatterplot (bytes vs packets) of 1000 random sample data and use logarithmic data if needed
- How are they related?
- What is the maximum average packet size for both (original data set and 1000 random sample data)?
Note: The average packet size of a flow is calculated with the number of bytes in a flow divided by its number of packets, that is, as the formula below:
Average_packet_size_of_a_flow = total_bytes_of_a_flow / total_packets_of_a_flow
Average throughput
- Calculate the average throughput of the connections. Clock resolution introduces some challenges, what can be said on the throughput of the flows that are transferred in zero time?
Note: The average throughput of a flow is the number of bytes transferred divided by the transfer time, that is, the difference between the arrival time of the last packet and the first packet.
- Study the average throughput both the original data set and 1000 random samples data. State your own analysis on the data.

Draw conclusions about your own observations on the data analyzed (original and random samples) and the usefulness of the graphs used.

Tips: - Useful Python functions could be pandas.sample(), pandas.plotting.parallel_coordinates(), matplotlib.pyplot.plot(). - Useful R functions could be sample(), ggparcoord(), plot().

Report, task 3

Plots requested above with commands used to generate them
Analysis of how bytes and packets are related.
Throughput analysis.
Conclusions

Grading standard

To pass this course, you need to achieve at least 15 points in this assignment. And if you submit the assignment late, you can get a maximum of 15 points.

You can get up to 30 points for this assignment:

Task 1

Answer the questions appropriately. (9p)

Task 2

Choose suitable distributions for the three data sets and explain why. (5p)
Determine their parameters. (3p)
Use graphs to verify your conjecture and explain them. (3p)

Task 3

Plot two parallel plots as required. (3p)
Draw a scatter plot as required and analyze. (3p)
Calculate average throughput and analyze. (2p)
Summary of observations on the data. (2p)

The quality of the report (bonus 2p)

The instruction of assignment

For the assignment, your submission must contain (Please don’t contain original data in your submission):

A zip file that includes your codes and scripts.
A PDF file as your report.

Regarding the report, your report must have:

A cover page indicating your name, student ID and your e-mail address.
The report should include a description of measurements, a summary of the results and conclusions based on the results.
An explanation of each problem, explain how you solved it and why you did it.

distribution_and_sampling.pdf
26 October 2022, 3:27 PM

ELEC-E7130 - Internet Traffic Measurements and Analysis, Lecture, 7.9.2022-5.12.2022