Using DataFrames.jl

Urtzi Enriquez-Urzelai
2025-04-28

In this example, I will work with the world population per country dataset, downloaded from the internet. This lesson is based on a youtube video [1].

Importing data

using DataFrames
using CSV

wp = DataFrame(CSV.File("./world_pop.csv"))
first(wp, 5)
5×8 DataFrame
RowCountryPopulation 2024Population 2023Area (km2)Density (/km2)Growth RateWorld %World Rank
StringInt64Int64String7Float64Float64Float64?Int64
1India144171985214286276633M485.00.00920.18011
2China142517878214256713529.4M151.0-0.00030.1782
3United States3418144203399965639.1M37.00.00530.04273
4Indonesia2797980492775341221.9M149.00.00820.0354
5Pakistan245209815240485658770.9K318.00.01960.03065
describe(wp)
8×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1CountryAfghanistanZimbabwe0String
2Population 20243.46886e75265.62636e614417198520Int64
3Population 20233.43744e75185.6439e614286276630Int64
4Area (km2)1.1K< 10String7
5Density (/km2)453.7880.1498.521674.00Float64
6Growth Rate0.00920043-0.03090.007950.04830Float64
7World %0.004446490.00.000750.18016Union{Missing, Float64}
8World Rank117.51117.52340Int64
names(wp)
8-element Vector{String}:
 "Country"
 "Population 2024"
 "Population 2023"
 "Area (km2)"
 "Density (/km2)"
 "Growth Rate"
 "World %"
 "World Rank"

Data wrangling

wp.id = 1:nrow(wp)
first(wp, 5)
5×9 DataFrame
RowCountryPopulation 2024Population 2023Area (km2)Density (/km2)Growth RateWorld %World Rankid
StringInt64Int64String7Float64Float64Float64?Int64Int64
1India144171985214286276633M485.00.00920.180111
2China142517878214256713529.4M151.0-0.00030.17822
3United States3418144203399965639.1M37.00.00530.042733
4Indonesia2797980492775341221.9M149.00.00820.03544
5Pakistan245209815240485658770.9K318.00.01960.030655
colnames = [:country, :pop2024, :pop2023, :area, :density, :growth_rate, :world_perc, :world_rank, :id]
rename!(wp, colnames)
describe(wp)
9×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64Type
1countryAfghanistanZimbabwe0String
2pop20243.46886e75265.62636e614417198520Int64
3pop20233.43744e75185.6439e614286276630Int64
4area1.1K< 10String7
5density453.7880.1498.521674.00Float64
6growth_rate0.00920043-0.03090.007950.04830Float64
7world_perc0.004446490.00.000750.18016Union{Missing, Float64}
8world_rank117.51117.52340Int64
9id117.51117.52340Int64
wp_clean = select!(wp, :id, :country, :pop2024, :growth_rate);

Be careful, the bang (!) modifies also the original table! Remember that we are not making copies, but creating new pointers to the same objects in memory.

describe(wp_clean)
4×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1id117.51117.52340Int64
2countryAfghanistanZimbabwe0String
3pop20243.46886e75265.62636e614417198520Int64
4growth_rate0.00920043-0.03090.007950.04830Float64
describe(wp)
4×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1id117.51117.52340Int64
2countryAfghanistanZimbabwe0String
3pop20243.46886e75265.62636e614417198520Int64
4growth_rate0.00920043-0.03090.007950.04830Float64

Subsetting

It is possible to check whether a string (i.e. country name) or value is present by using the in operator.

"Tanzania" in wp.country
true

We can get the index where a specific country is by using the findall() or findfirst() functions.

# with anonymous functions
findall(x -> x == "Tanzania", wp.country)

# or using the == function
findall(==("Tanzania"), wp.country)
1-element Vector{Int64}:
 21

And this allows us to subset our dataframe in several ways:

# using any of the possibles ways with findall() or findfirst()
wp[findall(==("Tanzania"), wp.country), :]
1×4 DataFrame
Rowidcountrypop2024growth_rate
Int64StringInt64Float64
121Tanzania694190730.0294
# or using broadcasting, similar as R syntax
wp[wp.country .== "Tanzania", :]
1×4 DataFrame
Rowidcountrypop2024growth_rate
Int64StringInt64Float64
121Tanzania694190730.0294

The wp.country .== "Tanzania" statement returns a vector of 0s and 1s, that is used for selecting the rows.

Footnotes:

1

Based on youtube video