Using DataFrames.jl

Author

Urtzi Enriquez-Urzelai

Published

December 3, 2025

In this example, I will work with the world population per country dataset, downloaded from the internet. This lesson is based on a youtube video 1.

Importing data

using DataFrames
using CSV

wp = DataFrame(CSV.File("./world_pop.csv"))
first(wp, 5)
5×8 DataFrame
Row Country Population 2024 Population 2023 Area (km2) Density (/km2) Growth Rate World % World Rank
String Int64 Int64 String7 Float64 Float64 Float64? Int64
1 India 1441719852 1428627663 3M 485.0 0.0092 0.1801 1
2 China 1425178782 1425671352 9.4M 151.0 -0.0003 0.178 2
3 United States 341814420 339996563 9.1M 37.0 0.0053 0.0427 3
4 Indonesia 279798049 277534122 1.9M 149.0 0.0082 0.035 4
5 Pakistan 245209815 240485658 770.9K 318.0 0.0196 0.0306 5
describe(wp)
8×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 Type
1 Country Afghanistan Zimbabwe 0 String
2 Population 2024 3.46886e7 526 5.62636e6 1441719852 0 Int64
3 Population 2023 3.43744e7 518 5.6439e6 1428627663 0 Int64
4 Area (km2) 1.1K < 1 0 String7
5 Density (/km2) 453.788 0.14 98.5 21674.0 0 Float64
6 Growth Rate 0.00920043 -0.0309 0.00795 0.0483 0 Float64
7 World % 0.00444649 0.0 0.00075 0.1801 6 Union{Missing, Float64}
8 World Rank 117.5 1 117.5 234 0 Int64
names(wp)
8-element Vector{String}:
 "Country"
 "Population 2024"
 "Population 2023"
 "Area (km2)"
 "Density (/km2)"
 "Growth Rate"
 "World %"
 "World Rank"

Data wrangling

wp.id = 1:nrow(wp)
first(wp, 5)
5×9 DataFrame
Row Country Population 2024 Population 2023 Area (km2) Density (/km2) Growth Rate World % World Rank id
String Int64 Int64 String7 Float64 Float64 Float64? Int64 Int64
1 India 1441719852 1428627663 3M 485.0 0.0092 0.1801 1 1
2 China 1425178782 1425671352 9.4M 151.0 -0.0003 0.178 2 2
3 United States 341814420 339996563 9.1M 37.0 0.0053 0.0427 3 3
4 Indonesia 279798049 277534122 1.9M 149.0 0.0082 0.035 4 4
5 Pakistan 245209815 240485658 770.9K 318.0 0.0196 0.0306 5 5
colnames = [:country, :pop2024, :pop2023, :area, :density, :growth_rate, :world_perc, :world_rank, :id]
rename!(wp, colnames)
describe(wp)
9×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 Type
1 country Afghanistan Zimbabwe 0 String
2 pop2024 3.46886e7 526 5.62636e6 1441719852 0 Int64
3 pop2023 3.43744e7 518 5.6439e6 1428627663 0 Int64
4 area 1.1K < 1 0 String7
5 density 453.788 0.14 98.5 21674.0 0 Float64
6 growth_rate 0.00920043 -0.0309 0.00795 0.0483 0 Float64
7 world_perc 0.00444649 0.0 0.00075 0.1801 6 Union{Missing, Float64}
8 world_rank 117.5 1 117.5 234 0 Int64
9 id 117.5 1 117.5 234 0 Int64
wp_clean = select!(wp, :id, :country, :pop2024, :growth_rate);

Be careful, the bang (!) modifies also the original table! Remember that we are not making copies, but creating new pointers to the same objects in memory.

describe(wp_clean)
4×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 id 117.5 1 117.5 234 0 Int64
2 country Afghanistan Zimbabwe 0 String
3 pop2024 3.46886e7 526 5.62636e6 1441719852 0 Int64
4 growth_rate 0.00920043 -0.0309 0.00795 0.0483 0 Float64
describe(wp)
4×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 id 117.5 1 117.5 234 0 Int64
2 country Afghanistan Zimbabwe 0 String
3 pop2024 3.46886e7 526 5.62636e6 1441719852 0 Int64
4 growth_rate 0.00920043 -0.0309 0.00795 0.0483 0 Float64

Subsetting

It is possible to check whether a string (i.e. country name) or value is present by using the in operator.

"Tanzania" in wp.country
true

We can get the index where a specific country is by using the findall() or findfirst() functions.

# with anonymous functions
findall(x -> x == "Tanzania", wp.country)

# or using the == function
findall(==("Tanzania"), wp.country)
1-element Vector{Int64}:
 21

And this allows us to subset our dataframe in several ways:

# using any of the possibles ways with findall() or findfirst()
wp[findall(==("Tanzania"), wp.country), :]
1×4 DataFrame
Row id country pop2024 growth_rate
Int64 String Int64 Float64
1 21 Tanzania 69419073 0.0294
# or using broadcasting, similar as R syntax
wp[wp.country .== "Tanzania", :]
1×4 DataFrame
Row id country pop2024 growth_rate
Int64 String Int64 Float64
1 21 Tanzania 69419073 0.0294

The wp.country .== "Tanzania" statement returns a vector of 0s and 1s, that is used for selecting the rows.

Footnotes:

Footnotes

  1. Based on youtube video↩︎