ELEC-E7130 - Internet Traffic Measurements and Analysis, Lecture, 7.9.2022-5.12.2022
This course space end date is set to 05.12.2022 Search Courses: ELEC-E7130
Assignment 7: Sampling
Prerequisites
To complete this assignment, students require to have prior knowledge about how to use R or Python, statistics, and how to leverage the libraries available to process, analyze and plot data.
If you lack the relevant skills, you may want to
- Take a look through related slides.
- Refer earlier assignments where you can learn some R and Python knowledge.
- Read the supporting materials.
Learning outcomes
At the end of this assignment, students should be able to
- Get to know in more detail the utility sampling and sampling distributions.
- Have a good understanding of how to sample their data set for other applications.
- Estimate the true mean by the sample mean in different ways from both a stored data set and real-time data.
- Take care of data before it will be handled and prepare the data set for real Machine Learning cases to continue with the process of selection of the best model through respective training, evaluation, and predictions.
Introduction
This assignment contains three tasks to cover important topics related to different sampling applications such as off-line estimation (collecting samples through a data set stored), on-line estimation (collecting samples in real-time, i.e., streaming data), or even one of the types of sampling for Machine Learning purposes called stratified random sampling. Please read all instructions before starting because it is helpful to identify common work.
- Task 1: Sampling and distributions (off-line sampling)
- Task 2: High variability (on-line sampling)
- Task 3: Data pre-processing for ML purposes
All data for tasks 1 and 2 can be found from the
sampling-data.zip
archive from the assignment page and
$RDATA/sampling-data.zip
(extracted to
$RDATA/sampling/
directory) at Aalto IT computers; while
the file for task 3 is located in the
/work/courses/unix/T/ELEC/E7130/general/ml-data
directory
or using MLDATA
environment variable as a path.
Task 1: Sampling and distributions (off-line sampling)
The first task aims to familiarize oneself with sampling and sampling distributions, and the size of sampling for statistics from a data set (off-line sampling).
Download file sampling.txt
which contains
certain session inter-arrival times to study estimation of the
mean inter-arrival time based on different sample sizes.
Complete the next action points:
- Original data
- Plot the histogram and compute the mean
- 5000 random samples
- Select 5000 random samples selected from original data, i.e., should have a vector with length 5000 values
- Plot the histogram and compute the mean
- Sample mean statistic with different n samples
- Select 10000 times n random elements from the data to
compute the mean of these n values. As a result, you should
have a vector of 10000 values, where each of them is the mean value of
n random elements. Repeat for:
- n = 5
- n = 10
- n = 100
- n = 5
- For each scenario of n:
- Plot the histogram of these 10000 values as well as Q-Q plot (or any of the goodness-of-fit plots) to study the values against normal distribution.
- Compute the mean and standard deviation of these 10000 values.
Note: Each mean contained in the vectors represent different results you could get for your statistic in a random sample and can be seen as samples from the sampling distribution of the sample mean statistic for n samples.
- Select 10000 times n random elements from the data to
compute the mean of these n values. As a result, you should
have a vector of 10000 values, where each of them is the mean value of
n random elements. Repeat for:
Discuss the following points:
- Explain the
effects of sample size on the sampling distribution and the
accuracy of the estimate based on both the results (mean and standard
deviation) and plots obtained by the different values of n and
5000 random samples concerning the original data.
- Compute and analyze
the sampling error-single mean and variance for each
scenario, and explain the differences between them for each one.
- What can you say about the sampling bias for each scenario?
Note: Sampling error-single mean is the difference between a sample statistic(x̄) and its corresponding population (μ) parameter, in this case, mean. Sampling error = x̄ − μ
Tips:
- Useful Python libraries could be
pandas
,matplotlib
,seaborn
,fitdist
,scipy
andstatistics
(forvariance()
) library. Besides,sm.qqplot()
can be used to plot Q-Q plot.- Useful R functions could be
hist()
,fitdistr()
,rnorm()
,qqplot()
,mean()
,sd()
, andvar()
.
Report, task 1
- Histogram plots of each case.
- Mean values of each case
- Q-Q plots, mean and standard deviations for the different cases of n.
- Discuss and draw observations of each case in terms of sample size, bias, variability, etc.
Reminder: Add commands that generated the plots and how statistics are computed.
Task 2: High variability (on-line sampling)
This task attempts to demonstrate the effects of high variability in network measurements by estimating means with on-line sampling, i.e., as “real-time” data; the previous task is focused on off-line estimation, which is obtained by a stored data set. On the other hand, high variability can, for example, make them unpredictable in the long term.
Download the file flows.txt
which
contains once more values of flow lengths in packets and in bytes
captured from a network
Complete the following action points:
Original data
- Compute the mean and median for both packets and bytes
- Plot the data set according to what you want to describe (there is no single correct plot)
On-line measurement
- First of all, develop a function called
running_mean
to calculate meann, that is, the sample means of the first n flow lengths in bytes. Thereby, the function writes y-axis, as the sample mean values, and x-axis, as the number of flows passed.
Hint: For example, there are 6 flows (flow1, flow2, flow3, and so on), and if n* is 3, i.e., to calculate the sample means of the first 3 flow lengths writting the axes.*
- The first sample mean is obtained considering the first flow (flow1)
- The second sample mean is obtained by the first 2 flows (flow1, flow2)
- The third sample mean is obtained by the first 3 flows (flow1, flow2, flow3)
Note: This mimics a kind of an on-line measurement; we assume that the flows depart one by one and our estimate of the mean flow size in bytes is updated each time
Using the
running_mean
, plot the mean estimate after each flow, i.e., plot the mean statistic for first observations as a function of n. Explain your observations concerning the original data and this scenario.Suppose that the interesting statistic is the median instead of the mean as
running_median
in an online scenario where a measurement system provides you with a large number of samples every second. How would you proceed in the function to calculate mediann?
- First of all, develop a function called
Draw your conclusion about the mean and median obtained and plots generated by both the online scenario and the original data set.
Report, task 2
- Plots and values from point 1.
- Expression for
running_mean
. - Plot of the mean estimate. Explain your observations.
- Computing median instead of mean. Derive expression.
- Observations with the results obtained by both scenarios.
Reminder: Document operations and reason your answers.
Task 3: Data pre-processing for ML purposes
The purpose of the last task is to introduce the preparing data set before choosing a model or even training. During this stage, it is important to select the samples appropriately, one of the techniques is called stratified random sampling, where the population data is divided into subgroups, known as strata, so that a specific number of samples are selected from those subgroups ensuring a balance of information for each subgroup based on the specific feature(s) (reducing selection bias and chances of sampling error as well as higher accuracy than simple random sampling).
Note: Data pre-processing is the most important step in most machine learning procedures. Not having the data in suitable form would increase the learning time or it would simply be impossible to learn for the ML model.
Download the file simple_flow_data.csv
which contains simplified NetMate output which only 6 columns:
source IP address, source port, destination IP
address, destination port, protocol number, and
duration of the flow (in micro-seconds).
Notes:
- The file can be found in the directory
/work/courses/unix/T/ELEC/E7130/general/ml-data
or usingMLDATA
environment variable as a path if you have sourced theuse.sh
file.- Important to consider source and destination IP addresses as non-numerical values, the rest are numerical values.
Perform a function to prepare the whole data set
through the steps below. Furthermore, you can use skeleton code
skeleton_ml_0.py
to solve task.
Delete the instances that have empty values
Perform stratified random sampling where:
- First, take 100 instances whose flow duration is less than 2000 microseconds.
- Then, take other 100 instances whose flow duration is more than 2000 microseconds.
- Finally, concatenate both to have 200 data samples in total.
Encode the non-numerical values, i.e., srcip and dstip.
Standardize the values
Normalize the values between 0 and 1
Return the new data set pre-processed.
Note: At the end, the data set must constain 200 instances, and it would look something like the following (rows were shuffled here):
srcip srcport dstip dstport proto duration 109 0.500000 0.835867 0.242857 0.006227 0.3125 0.196619 115 0.500000 0.547144 0.628571 0.000431 0.3125 0.193576 181 0.142857 0.287867 0.157143 0.278581 1.0000 0.964189 87 0.500000 0.751349 0.171429 0.159641 0.3125 0.000003 163 0.500000 0.616890 0.542857 0.006227 0.3125 0.098573
Answer the following points:
- Mention three types of probability sampling applied in ML apart from the one already mentioned.
- What is the purpose of encoding the values in ML?
- What are the differences between standardization and normalization in terms of feature scaling in ML?
Tips:
- Useful Python functions could be
fit_transform()
.- Search for the documentation of the functions
LabelEncoder()
,StandardScaler()
,MinMaxScaler()
to perform the steps above in case of the libraryscikit-learn
.
Report, task 3
- Perform successfully the data pre-processing.
- Answer the questions appropriately
Reminder: Document operations and code used.
Grading standard
To pass this course, you need to achieve at least 15 points in this assignment. And if you submit the assignment late, you can get a maximum of 15 points.
You can get up to 30 points for this assignment:
Task 1
- Draw a histogram and QQ plot for each sample, and calculate its average and standard deviation. (10p)
- Discussion based on the results obtained previously. (6p)
Task 2
- Plot and calculate the average and median as required (original data). (2p)
- Write the correct running mean expression. (1p)
- Plot the estimate of the mean and state your observations. (2p)
- Write the correct running median expression and plot. (2p)
- Conclusions about the results obtained of mean and median (1p)
Task 3 - Prepare data set for Machine Learning purposes (2p) - Answer the questions appropriately (4p)
The quality of the report (bonus 2p)
The instruction of assignment
For the assignment, your submission must contain (Please don’t contain original data in your submission):
- A zip file that includes your codes and scripts.
- A PDF file as your report.
Regarding the report, your report must have:
- A cover page indicating your name, student ID and your e-mail address.
- The report should include a description of measurements, a summary of the results and conclusions based on the results.
- An explanation of each problem, explain how you solved it and why you did it.
- 28 October 2022, 2:14 PM