clear

* Create paths to folders "data" and "output" by using global command
* Read more: help global
global data "C:\Users\aapo.kivinen\Dropbox (Aalto)\PhD\teaching\Principles_of_Economic_Analysis\aapo\session one\data"
global output "C:\Users\aapo.kivinen\Dropbox (Aalto)\PhD\teaching\Principles_of_Economic_Analysis\aapo\session one\output"

/* Import dataset to Stata
 csv-files are imported by import delimited. Option clear indicates that we want
 to clear potential dataset that we previously had in use (without clear an error
 would occur). In delim option, I indicate the delimiting character between variables.
 This was chosen at StatFin webpage.
 */
import delimited "$data\004_115c.csv",clear delim(";")

describe 
browse

* Give shorter variable name to a variable:
rename maintypeofactivity main_activity 

* Drop a varible:
drop sex 


* Change a value of a (string) variable within a variable.
replace age = "100" if age == "100 -"

* Now, we can change variable type from string to numerical. 
* More info: help destring
destring age, replace

*Drop population outside of labor force: younger than 15 or older than 74:
drop if (age < 15 | age>74)

*Check what occupations there are in the data 
codebook main_activity
* Drop irrelevant group:
drop if main_activity == "0-14 years old"

* Rename variables to make naming consistent:
rename population31dec v4


* Transform into "long" format. This is a difficult concept (at least to me) to grasp
* See: help reshape
reshape long v, i(main_activity age) j(year)

* Rename a variable
rename v population 


* Set year into correct ones:
replace year = year + 1983

*Transform age into "wide" format 
reshape wide population, i(main_activity year) j(age)

* Aggregating the data into 10-year age bins. This is done by two nested for-loops.

describe 
browse

* Give shorter variable name to a variable:
rename maintypeofactivity main_activity 

* Drop a varible:
drop sex 


* Change a value of a (string) variable within a variable.
replace age = "100" if age == "100 -"

* Now, we can change variable type from string to numerical. 
* More info: help destring
destring age, replace

*Drop population outside of labor force: younger than 15 or older than 74:
drop if (age < 15 | age>74)

*Check what occupations there are in the data 
codebook main_activity
* Drop irrelevant group:
drop if main_activity == "0-14 years old"

* Rename variables to make naming consistent:
rename population31dec v4


* Transform into "long" format. This is a difficult concept (at least to me) to grasp
* See: help reshape
reshape long v, i(main_activity age) j(year)

* Rename a variable
rename v population 


* Set year into correct ones:
replace year = year + 1983

*Transform age into "wide" format 
reshape wide population, i(main_activity year) j(age)

/* 
 Aggregating the data into 10-year age bins. This is done by using a for-loop,
 locals and egen function rsum. We run a for-loop for values 15, 25,..., 65.
 Then, we create a local that is value in a loop, called bottom, +9. Finally,
 we create a new variable that takes the sum between these variables eg. 15 to 24.
 (Don't worry if you are confused here, this is not easy).
*/
local var = "population"
forvalues bottom=15(10)74{
	local top=`bottom'+9
	egen `var'`bottom'`top'=rsum(`var'`bottom'-`var'`top')
}


*Dropping redundant (old) variables 
drop population15-population74

* Save this file as a tempfile
* See more: help tempfile
tempfile emp
save `emp'

*Import second data set to Stata 
import delimited "$data\009_123x.csv", clear
rename currentpriceseuro v2


*Transform into "long" format
reshape long v, j(year) i(transaction)


drop transaction 
*correct years:
replace year = year + 1978

drop if year < 1987

rename v gdp

*Merge with employment statistics 
merge 1:m year using `emp'
drop _merge 

*Sort data set 
sort main_activity year 

save "$data\cleaned_data.dta", replace