En este ejemplo, trabajaré con un conjunto de datos de la población mundial por país, descargado de internet. Esta lección está basada en un vídeo de YouTube 1.
Importación de datos#
using DataFrames
using CSV
wp = DataFrame(CSV.File("./world_pop.csv"))
first(wp, 5)| Row | Country | Population 2024 | Population 2023 | Area (km2) | Density (/km2) | Growth Rate | World % | World Rank |
|---|---|---|---|---|---|---|---|---|
| String | Int64 | Int64 | String7 | Float64 | Float64 | Float64? | Int64 | |
| 1 | India | 1441719852 | 1428627663 | 3M | 485.0 | 0.0092 | 0.1801 | 1 |
| 2 | China | 1425178782 | 1425671352 | 9.4M | 151.0 | -0.0003 | 0.178 | 2 |
| 3 | United States | 341814420 | 339996563 | 9.1M | 37.0 | 0.0053 | 0.0427 | 3 |
| 4 | Indonesia | 279798049 | 277534122 | 1.9M | 149.0 | 0.0082 | 0.035 | 4 |
| 5 | Pakistan | 245209815 | 240485658 | 770.9K | 318.0 | 0.0196 | 0.0306 | 5 |
describe(wp)| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | Type | |
| 1 | Country | Afghanistan | Zimbabwe | 0 | String | ||
| 2 | Population 2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
| 3 | Population 2023 | 3.43744e7 | 518 | 5.6439e6 | 1428627663 | 0 | Int64 |
| 4 | Area (km2) | 1.1K | < 1 | 0 | String7 | ||
| 5 | Density (/km2) | 453.788 | 0.14 | 98.5 | 21674.0 | 0 | Float64 |
| 6 | Growth Rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
| 7 | World % | 0.00444649 | 0.0 | 0.00075 | 0.1801 | 6 | Union{Missing, Float64} |
| 8 | World Rank | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
names(wp)8-element Vector{String}:
"Country"
"Population 2024"
"Population 2023"
"Area (km2)"
"Density (/km2)"
"Growth Rate"
"World %"
"World Rank"
Manipulación de datos (Data wrangling)#
Code snippet
wp.id = 1:nrow(wp)
first(wp, 5)| Row | Country | Population 2024 | Population 2023 | Area (km2) | Density (/km2) | Growth Rate | World % | World Rank | id |
|---|---|---|---|---|---|---|---|---|---|
| String | Int64 | Int64 | String7 | Float64 | Float64 | Float64? | Int64 | Int64 | |
| 1 | India | 1441719852 | 1428627663 | 3M | 485.0 | 0.0092 | 0.1801 | 1 | 1 |
| 2 | China | 1425178782 | 1425671352 | 9.4M | 151.0 | -0.0003 | 0.178 | 2 | 2 |
| 3 | United States | 341814420 | 339996563 | 9.1M | 37.0 | 0.0053 | 0.0427 | 3 | 3 |
| 4 | Indonesia | 279798049 | 277534122 | 1.9M | 149.0 | 0.0082 | 0.035 | 4 | 4 |
| 5 | Pakistan | 245209815 | 240485658 | 770.9K | 318.0 | 0.0196 | 0.0306 | 5 | 5 |
colnames = [:country, :pop2024, :pop2023, :area, :density, :growth_rate, :world_perc, :world_rank, :id]
rename!(wp, colnames)
describe(wp)| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | Type | |
| 1 | country | Afghanistan | Zimbabwe | 0 | String | ||
| 2 | pop2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
| 3 | pop2023 | 3.43744e7 | 518 | 5.6439e6 | 1428627663 | 0 | Int64 |
| 4 | area | 1.1K | < 1 | 0 | String7 | ||
| 5 | density | 453.788 | 0.14 | 98.5 | 21674.0 | 0 | Float64 |
| 6 | growth_rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
| 7 | world_perc | 0.00444649 | 0.0 | 0.00075 | 0.1801 | 6 | Union{Missing, Float64} |
| 8 | world_rank | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
| 9 | id | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
wp_clean = select!(wp, :id, :country, :pop2024, :growth_rate);¡Ten cuidado! El signo de exclamación (!) modifica también la tabla original. Recuerda que no estamos haciendo copias, sino creando nuevos punteros a los mismos objetos en memoria.
describe(wp_clean)| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
| 1 | id | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
| 2 | country | Afghanistan | Zimbabwe | 0 | String | ||
| 3 | pop2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
| 4 | growth_rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
describe(wp)| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
| 1 | id | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
| 2 | country | Afghanistan | Zimbabwe | 0 | String | ||
| 3 | pop2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
| 4 | growth_rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
Subconjuntos (Subsetting)#
Es posible comprobar si una cadena (p. ej., el nombre de un país) o un valor está presente utilizando el operador in.
"Tanzania" in wp.countrytrue
Podemos obtener el índice donde se encuentra un país específico utilizando las funciones findall() o findfirst().
# con funciones anónimas
findall(x -> x == "Tanzania", wp.country)1-element Vector{Int64}:
21
# o utilizando la función ==
findall(==("Tanzania"), wp.country)1-element Vector{Int64}:
21
Y esto nos permite extraer subconjuntos de nuestro dataframe de varias formas:
# usando cualquiera de las formas posibles con findall() o findfirst()
wp[findall(==("Tanzania"), wp.country), :]| Row | id | country | pop2024 | growth_rate |
|---|---|---|---|---|
| Int64 | String | Int64 | Float64 | |
| 1 | 21 | Tanzania | 69419073 | 0.0294 |
# o usando broadcasting, de forma similar a la sintaxis de R
wp[wp.country .== "Tanzania", :]| Row | id | country | pop2024 | growth_rate |
|---|---|---|---|---|
| Int64 | String | Int64 | Float64 | |
| 1 | 21 | Tanzania | 69419073 | 0.0294 |
La sentencia wp.country .== “Tanzania” devuelve un vector de 0s y 1s, que se utiliza para seleccionar las filas.
Basado en video de youtube ↩︎


