Using DataFrames.jl

Urtzi Enriquez-Urzelai

2025-04-28

In this example, I will work with the world population per country dataset, downloaded from the internet. This lesson is based on a youtube video [1].

Importing data

using DataFrames
using CSV

wp = DataFrame(CSV.File("./world_pop.csv"))
first(wp, 5)

5×8 DataFrame

Row	Country	Population 2024	Population 2023	Area (km2)	Density (/km2)	Growth Rate	World %	World Rank
	String	Int64	Int64	String7	Float64	Float64	Float64?	Int64
1	India	1441719852	1428627663	3M	485.0	0.0092	0.1801	1
2	China	1425178782	1425671352	9.4M	151.0	-0.0003	0.178	2
3	United States	341814420	339996563	9.1M	37.0	0.0053	0.0427	3
4	Indonesia	279798049	277534122	1.9M	149.0	0.0082	0.035	4
5	Pakistan	245209815	240485658	770.9K	318.0	0.0196	0.0306	5

describe(wp)

8×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	Type
1	Country		Afghanistan		Zimbabwe	0	String
2	Population 2024	3.46886e7	526	5.62636e6	1441719852	0	Int64
3	Population 2023	3.43744e7	518	5.6439e6	1428627663	0	Int64
4	Area (km2)		1.1K		< 1	0	String7
5	Density (/km2)	453.788	0.14	98.5	21674.0	0	Float64
6	Growth Rate	0.00920043	-0.0309	0.00795	0.0483	0	Float64
7	World %	0.00444649	0.0	0.00075	0.1801	6	Union{Missing, Float64}
8	World Rank	117.5	1	117.5	234	0	Int64

names(wp)

8-element Vector{String}:
 "Country"
 "Population 2024"
 "Population 2023"
 "Area (km2)"
 "Density (/km2)"
 "Growth Rate"
 "World %"
 "World Rank"

Data wrangling

wp.id = 1:nrow(wp)
first(wp, 5)

5×9 DataFrame

Row	Country	Population 2024	Population 2023	Area (km2)	Density (/km2)	Growth Rate	World %	World Rank	id
	String	Int64	Int64	String7	Float64	Float64	Float64?	Int64	Int64
1	India	1441719852	1428627663	3M	485.0	0.0092	0.1801	1	1
2	China	1425178782	1425671352	9.4M	151.0	-0.0003	0.178	2	2
3	United States	341814420	339996563	9.1M	37.0	0.0053	0.0427	3	3
4	Indonesia	279798049	277534122	1.9M	149.0	0.0082	0.035	4	4
5	Pakistan	245209815	240485658	770.9K	318.0	0.0196	0.0306	5	5

colnames = [:country, :pop2024, :pop2023, :area, :density, :growth_rate, :world_perc, :world_rank, :id]
rename!(wp, colnames)
describe(wp)

9×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	Type
1	country		Afghanistan		Zimbabwe	0	String
2	pop2024	3.46886e7	526	5.62636e6	1441719852	0	Int64
3	pop2023	3.43744e7	518	5.6439e6	1428627663	0	Int64
4	area		1.1K		< 1	0	String7
5	density	453.788	0.14	98.5	21674.0	0	Float64
6	growth_rate	0.00920043	-0.0309	0.00795	0.0483	0	Float64
7	world_perc	0.00444649	0.0	0.00075	0.1801	6	Union{Missing, Float64}
8	world_rank	117.5	1	117.5	234	0	Int64
9	id	117.5	1	117.5	234	0	Int64

wp_clean = select!(wp, :id, :country, :pop2024, :growth_rate);

Be careful, the bang (!) modifies also the original table! Remember that we are not making copies, but creating new pointers to the same objects in memory.

describe(wp_clean)

4×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	id	117.5	1	117.5	234	0	Int64
2	country		Afghanistan		Zimbabwe	0	String
3	pop2024	3.46886e7	526	5.62636e6	1441719852	0	Int64
4	growth_rate	0.00920043	-0.0309	0.00795	0.0483	0	Float64

describe(wp)

4×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	id	117.5	1	117.5	234	0	Int64
2	country		Afghanistan		Zimbabwe	0	String
3	pop2024	3.46886e7	526	5.62636e6	1441719852	0	Int64
4	growth_rate	0.00920043	-0.0309	0.00795	0.0483	0	Float64

Subsetting

It is possible to check whether a string (i.e. country name) or value is present by using the in operator.

"Tanzania" in wp.country

true

We can get the index where a specific country is by using the findall() or findfirst() functions.

# with anonymous functions
findall(x -> x == "Tanzania", wp.country)

# or using the == function
findall(==("Tanzania"), wp.country)

1-element Vector{Int64}:
 21

And this allows us to subset our dataframe in several ways:

# using any of the possibles ways with findall() or findfirst()
wp[findall(==("Tanzania"), wp.country), :]

1×4 DataFrame

Row	id	country	pop2024	growth_rate
	Int64	String	Int64	Float64
1	21	Tanzania	69419073	0.0294

# or using broadcasting, similar as R syntax
wp[wp.country .== "Tanzania", :]

1×4 DataFrame

Row	id	country	pop2024	growth_rate
	Int64	String	Int64	Float64
1	21	Tanzania	69419073	0.0294

The wp.country .== "Tanzania" statement returns a vector of 0s and 1s, that is used for selecting the rows.

Footnotes:

Based on youtube video