Adibide honetan, internetetik deskargatutako munduko herrialdeen populazio-datuekin egingo dut lan. Ikasgai hau YouTubeko bideo batean oinarrituta dago.
Datuak inportatzen
#using DataFrames
using CSV
wp = DataFrame(CSV.File("./world_pop.csv"))
first(wp, 5)
| Row | Country | Population 2024 | Population 2023 | Area (km2) | Density (/km2) | Growth Rate | World % | World Rank |
|---|
| String | Int64 | Int64 | String7 | Float64 | Float64 | Float64? | Int64 |
| 1 | India | 1441719852 | 1428627663 | 3M | 485.0 | 0.0092 | 0.1801 | 1 |
| 2 | China | 1425178782 | 1425671352 | 9.4M | 151.0 | -0.0003 | 0.178 | 2 |
| 3 | United States | 341814420 | 339996563 | 9.1M | 37.0 | 0.0053 | 0.0427 | 3 |
| 4 | Indonesia | 279798049 | 277534122 | 1.9M | 149.0 | 0.0082 | 0.035 | 4 |
| 5 | Pakistan | 245209815 | 240485658 | 770.9K | 318.0 | 0.0196 | 0.0306 | 5 |
| Row | variable | mean | min | median | max | nmissing | eltype |
|---|
| Symbol | Union… | Any | Union… | Any | Int64 | Type |
| 1 | Country | | Afghanistan | | Zimbabwe | 0 | String |
| 2 | Population 2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
| 3 | Population 2023 | 3.43744e7 | 518 | 5.6439e6 | 1428627663 | 0 | Int64 |
| 4 | Area (km2) | | 1.1K | | < 1 | 0 | String7 |
| 5 | Density (/km2) | 453.788 | 0.14 | 98.5 | 21674.0 | 0 | Float64 |
| 6 | Growth Rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
| 7 | World % | 0.00444649 | 0.0 | 0.00075 | 0.1801 | 6 | Union{Missing, Float64} |
| 8 | World Rank | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
8-element Vector{String}:
"Country"
"Population 2024"
"Population 2023"
"Area (km2)"
"Density (/km2)"
"Growth Rate"
"World %"
"World Rank"
Datuen eraldaketa (Data wrangling)
#wp.id = 1:nrow(wp)
first(wp, 5)
| Row | Country | Population 2024 | Population 2023 | Area (km2) | Density (/km2) | Growth Rate | World % | World Rank | id |
|---|
| String | Int64 | Int64 | String7 | Float64 | Float64 | Float64? | Int64 | Int64 |
| 1 | India | 1441719852 | 1428627663 | 3M | 485.0 | 0.0092 | 0.1801 | 1 | 1 |
| 2 | China | 1425178782 | 1425671352 | 9.4M | 151.0 | -0.0003 | 0.178 | 2 | 2 |
| 3 | United States | 341814420 | 339996563 | 9.1M | 37.0 | 0.0053 | 0.0427 | 3 | 3 |
| 4 | Indonesia | 279798049 | 277534122 | 1.9M | 149.0 | 0.0082 | 0.035 | 4 | 4 |
| 5 | Pakistan | 245209815 | 240485658 | 770.9K | 318.0 | 0.0196 | 0.0306 | 5 | 5 |
colnames = [:country, :pop2024, :pop2023, :area, :density, :growth_rate, :world_perc, :world_rank, :id]
rename!(wp, colnames)
describe(wp)
| Row | variable | mean | min | median | max | nmissing | eltype |
|---|
| Symbol | Union… | Any | Union… | Any | Int64 | Type |
| 1 | country | | Afghanistan | | Zimbabwe | 0 | String |
| 2 | pop2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
| 3 | pop2023 | 3.43744e7 | 518 | 5.6439e6 | 1428627663 | 0 | Int64 |
| 4 | area | | 1.1K | | < 1 | 0 | String7 |
| 5 | density | 453.788 | 0.14 | 98.5 | 21674.0 | 0 | Float64 |
| 6 | growth_rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
| 7 | world_perc | 0.00444649 | 0.0 | 0.00075 | 0.1801 | 6 | Union{Missing, Float64} |
| 8 | world_rank | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
| 9 | id | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
wp_clean = select!(wp, :id, :country, :pop2024, :growth_rate);
Kontuz ibili, harridura-markak (!) jatorrizko taula ere aldatzen baitu! Gogoratu ez garela kopiak egiten ari, memoriako objektu berberetara zuzentzen duten erakusle (pointer) berriak sortzen baizik.
| Row | variable | mean | min | median | max | nmissing | eltype |
|---|
| Symbol | Union… | Any | Union… | Any | Int64 | DataType |
| 1 | id | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
| 2 | country | | Afghanistan | | Zimbabwe | 0 | String |
| 3 | pop2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
| 4 | growth_rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
| Row | variable | mean | min | median | max | nmissing | eltype |
|---|
| Symbol | Union… | Any | Union… | Any | Int64 | DataType |
| 1 | id | 117.5 | 1 | 117.5 | 234 | 0 | Int64 |
| 2 | country | | Afghanistan | | Zimbabwe | 0 | String |
| 3 | pop2024 | 3.46886e7 | 526 | 5.62636e6 | 1441719852 | 0 | Int64 |
| 4 | growth_rate | 0.00920043 | -0.0309 | 0.00795 | 0.0483 | 0 | Float64 |
Azpimultzoak hautatzea (Subsetting)
#Kate bat (adibidez, herrialde baten izena) edo balio bat presente dagoen egiaztatzeko in operadorea erabili daiteke.
true
Herrialde zehatz bat zein indizetan dagoen jakiteko findall() edo findfirst() funtzioak erabili ditzakegu.
# funtzio anonimoekin
findall(x -> x == "Tanzania", wp.country)
# edo == funtzioa erabiliz
findall(==("Tanzania"), wp.country)
1-element Vector{Int64}:
21
Horrek gure dataframe-aren azpimultzoak hainbat modutan hautatzeko aukera ematen digu:
# findall() edo findfirst() erabiliz
wp[findall(==("Tanzania"), wp.country), :]
| Row | id | country | pop2024 | growth_rate |
|---|
| Int64 | String | Int64 | Float64 |
| 1 | 21 | Tanzania | 69419073 | 0.0294 |
# edo "broadcasting" bidez, R-ko sintaxiaren antzera
wp[wp.country .== "Tanzania", :]
| Row | id | country | pop2024 | growth_rate |
|---|
| Int64 | String | Int64 | Float64 |
| 1 | 21 | Tanzania | 69419073 | 0.0294 |
wp.country .== "Tanzania" adierazpenak 0 eta 1 balioez osatutako bektore bat itzultzen du, eta bektore hori errenkadak hautatzeko erabiltzen da.