Data analysis in Julia

Urtzi Enriquez-Urzelai
2025-04-29

Importing data

In this first example, we will import the data from Iris, that I downloaded for another course (git course). For that we only need to:

using DataFrames, DataFramesMeta
using RDatasets
using CategoricalArrays

iris = dataset("datasets", "iris")
#=iris[!, :Species] = categorical(iris[!, :Species])=#
first(iris, 5)
5×5 DataFrame
RowSepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa

Basic plotting

And a simple boxplot showing the sepal length per species:

using StatsPlots: boxplot
boxplot(iris.Species, iris.SepalLength)
using CairoMakie
fig, ax, plot = scatter(iris.PetalLength, iris.SepalLength, 
                        color=levelcode.(iris.Species))
ax2 = Axis(fig[1,2])
scatter!(ax2, iris.PetalLength, iris.PetalWidth,
         color = levelcode.(iris.Species))
fig

Principal Component Analysis (PCA)

using MultivariateStats
X = Matrix(iris[:, Not(:Species)])'
y = iris.Species
species = unique(iris.Species)

model = fit(PCA, X; maxoutdim = size(X)[1])
PCA(indim = 4, outdim = 3, principalratio = 0.9947878161267246)

Pattern matrix (unstandardized loadings):
────────────────────────────────────
         PC1         PC2         PC3
────────────────────────────────────
1   0.743108   0.323446   -0.16277
2  -0.173801   0.359689    0.167212
3   1.76155   -0.0854062   0.0213202
4   0.736739  -0.0371832   0.152647
────────────────────────────────────

Importance of components:
─────────────────────────────────────────────────────────
                                PC1        PC2        PC3
─────────────────────────────────────────────────────────
SS Loadings (Eigenvalues)  4.22824   0.242671   0.0782095
Variance explained         0.924619  0.0530665  0.0171026
Cumulative variance        0.924619  0.977685   0.994788
Proportion explained       0.929463  0.0533445  0.0171922
Cumulative proportion      0.929463  0.982808   1.0
─────────────────────────────────────────────────────────
X_transform = MultivariateStats.transform(model, X)'
X_df = DataFrame(Matrix(X_transform), :auto)
rename!(X_df, :x1 => :PC1, :x2 => :PC2, :x3 => :PC3)
X_df.Species = iris.Species;
fig_pca = Figure();
ax_pca = Axis(fig_pca[1, 1],
              xlabel = "PC1",
              ylabel = "PC2")
@with X_df begin
    scatter!(ax_pca, :PC1, :PC2, 
             color = levelcode.(:Species),
             label = levelcode.(:Species))
end
fig_pca

Basic inference: ANOVA on a linear model

using AnovaGLM
aovlm = anova_lm(@formula(SepalLength ~ PetalLength * Species), iris)
Analysis of Variance

Type 1 test / F test

SepalLength ~ 1 + PetalLength + Species + PetalLength & Species

Table:
────────────────────────────────────────────────────────────────────────
                       DOF     Exp.SS  Mean Square     F value  Pr(>|F|)
────────────────────────────────────────────────────────────────────────
(Intercept)              1  5121.68      5121.68    45244.8660    <1e-99
PetalLength              1    77.64        77.64      685.8998    <1e-55
Species                  2     7.8434       3.9217     34.6441    <1e-12
PetalLength & Species    2     0.3810       0.1905      1.6828    0.1895
(Residuals)            144    16.30         0.1132                
────────────────────────────────────────────────────────────────────────

Adding the regression lines

using ColorSchemes
import ColorSchemes.viridis

lmmod = lm(@formula(SepalLength ~ PetalLength), iris)
lmmod_gp = lm(@formula(SepalLength ~ PetalLength * Species), iris)
gp_mod_coef = coef(lmmod_gp)

fig = Figure()
ax = Axis(fig[1, 1],
          xlabel = "Petal Length",
          ylabel = "Petal Width")
scatter!(iris.PetalLength, iris.SepalLength, 
         color=levelcode.(iris.Species));
lines!(ax, 
       iris.PetalLength, 
       predict(lmmod), 
       linewidth = 2,
       color = :black)
ablines!(gp_mod_coef[1],
       gp_mod_coef[2],
       linewidth = 2,
       linestyle = :dash,
         color = viridis[0.0])
ablines!(gp_mod_coef[1] + gp_mod_coef[3],
       gp_mod_coef[2] + gp_mod_coef[5],
       linewidth = 2,
       linestyle = :dash,
         color = viridis[0.5])
ablines!(gp_mod_coef[1] + gp_mod_coef[4],
       gp_mod_coef[2] + gp_mod_coef[6],
       linewidth = 2,
       linestyle = :dash,
         color = viridis[1.0])
fig