In this example, I will work with the world population per country dataset, downloaded from the internet. This lesson is based on a youtube video [1].
using DataFrames using CSV wp = DataFrame(CSV.File("./world_pop.csv")) first(wp, 5)
Row | Country | Population 2024 | Population 2023 | Area (km2) | Density (/km2) | Growth Rate | World % | World Rank |
---|---|---|---|---|---|---|---|---|
String | Int64 | Int64 | String7 | Float64 | Float64 | Float64? | Int64 | |
1 | India | 1441719852 | 1428627663 | 3M | 485.0 | 0.0092 | 0.1801 | 1 |
2 | China | 1425178782 | 1425671352 | 9.4M | 151.0 | -0.0003 | 0.178 | 2 |
3 | United States | 341814420 | 339996563 | 9.1M | 37.0 | 0.0053 | 0.0427 | 3 |
4 | Indonesia | 279798049 | 277534122 | 1.9M | 149.0 | 0.0082 | 0.035 | 4 |
5 | Pakistan | 245209815 | 240485658 | 770.9K | 318.0 | 0.0196 | 0.0306 | 5 |
describe(wp)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | Type | |
1 | Country | Afghanistan | Zimbabwe | 0 | String | ||
2 | Population 2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
3 | Population 2023 | 3.43744e7 | 518 | 5.6439e6 | 1428627663 | 0 | Int64 |
4 | Area (km2) | 1.1K | < 1 | 0 | String7 | ||
5 | Density (/km2) | 453.788 | 0.14 | 98.5 | 21674.0 | 0 | Float64 |
6 | Growth Rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
7 | World % | 0.00444649 | 0.0 | 0.00075 | 0.1801 | 6 | Union{Missing, Float64} |
8 | World Rank | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
names(wp)
8-element Vector{String}: "Country" "Population 2024" "Population 2023" "Area (km2)" "Density (/km2)" "Growth Rate" "World %" "World Rank"
wp.id = 1:nrow(wp) first(wp, 5)
Row | Country | Population 2024 | Population 2023 | Area (km2) | Density (/km2) | Growth Rate | World % | World Rank | id |
---|---|---|---|---|---|---|---|---|---|
String | Int64 | Int64 | String7 | Float64 | Float64 | Float64? | Int64 | Int64 | |
1 | India | 1441719852 | 1428627663 | 3M | 485.0 | 0.0092 | 0.1801 | 1 | 1 |
2 | China | 1425178782 | 1425671352 | 9.4M | 151.0 | -0.0003 | 0.178 | 2 | 2 |
3 | United States | 341814420 | 339996563 | 9.1M | 37.0 | 0.0053 | 0.0427 | 3 | 3 |
4 | Indonesia | 279798049 | 277534122 | 1.9M | 149.0 | 0.0082 | 0.035 | 4 | 4 |
5 | Pakistan | 245209815 | 240485658 | 770.9K | 318.0 | 0.0196 | 0.0306 | 5 | 5 |
colnames = [:country, :pop2024, :pop2023, :area, :density, :growth_rate, :world_perc, :world_rank, :id] rename!(wp, colnames) describe(wp)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | Type | |
1 | country | Afghanistan | Zimbabwe | 0 | String | ||
2 | pop2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
3 | pop2023 | 3.43744e7 | 518 | 5.6439e6 | 1428627663 | 0 | Int64 |
4 | area | 1.1K | < 1 | 0 | String7 | ||
5 | density | 453.788 | 0.14 | 98.5 | 21674.0 | 0 | Float64 |
6 | growth_rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
7 | world_perc | 0.00444649 | 0.0 | 0.00075 | 0.1801 | 6 | Union{Missing, Float64} |
8 | world_rank | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
9 | id | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
wp_clean = select!(wp, :id, :country, :pop2024, :growth_rate);
Be careful, the bang (!
) modifies also the original table! Remember that we are not making copies, but creating new pointers to the same objects in memory.
describe(wp_clean)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | id | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
2 | country | Afghanistan | Zimbabwe | 0 | String | ||
3 | pop2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
4 | growth_rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
describe(wp)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | id | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
2 | country | Afghanistan | Zimbabwe | 0 | String | ||
3 | pop2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
4 | growth_rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
It is possible to check whether a string (i.e. country name) or value is present by using the in
operator.
"Tanzania" in wp.country
true
We can get the index where a specific country is by using the findall()
or findfirst()
functions.
# with anonymous functions findall(x -> x == "Tanzania", wp.country) # or using the == function findall(==("Tanzania"), wp.country)
1-element Vector{Int64}: 21
And this allows us to subset our dataframe in several ways:
# using any of the possibles ways with findall() or findfirst() wp[findall(==("Tanzania"), wp.country), :]
Row | id | country | pop2024 | growth_rate |
---|---|---|---|---|
Int64 | String | Int64 | Float64 | |
1 | 21 | Tanzania | 69419073 | 0.0294 |
# or using broadcasting, similar as R syntax wp[wp.country .== "Tanzania", :]
Row | id | country | pop2024 | growth_rate |
---|---|---|---|---|
Int64 | String | Int64 | Float64 | |
1 | 21 | Tanzania | 69419073 | 0.0294 |
The wp.country .== "Tanzania"
statement returns a vector of 0
s and 1
s, that is used for selecting the rows.
Footnotes:
1
Based on youtube video