Machine learning in Julia: Titanic
Published:
Introduction
In this article I will describe my first experiment using Julia for machine learning. I will use the Titanic dataset from Kaggle as the example.
Which IDE
I experimented with 3 different environments to do machine learning in Julia
- VS code
- Pluto.jl
- JupyterLab
VS Code works fine with Julia, but for machine learning it makes sense if you can experiment a lot with settings, so a solutions where you can run updates on individual lines of codes is preferred. I then tried Pluto.jl: it works great and the feedback what’s happening in the background is great. The only downside is that by default you only can have 1 statement per cell. JupyterLab does allow multiple statements per cell, but that requires installing Python even though I want to program in Julia ! Also the feedback what’s happening is not as detailed/nice as in Pluto.jl. In the end I settled on using VS Code with .ipynb (=Jupyter notebook files). Basically you get the benefits of Jupyter without installing Jupiter.
Load data
Let’s start by reading the data from CSV files and load them in DataFrames
using CSV, DataFrames, CategoricalArrays, Statistics, MLJ, MLJDecisionTreeInterface, TableTransforms,Flux, CUDA, Plots
train = "C:\Git\juliacode\Data\titanic-data\train.csv"
test = "C:\Git\juliacode\Data\\titanic-data\test.csv"
df_train = DataFrame(CSV.File(train))
df_test = DataFrame(CSV.File(test))
describe(df_train)
12×7 DataFrame
Row │ variable mean min median max ⋯
│ Symbol Union… Any Union… Any ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ PassengerId 446.0 1 446.0 891 ⋯
2 │ Survived 0.383838 0 0.0 1
3 │ Pclass 2.30864 1 3.0 3
4 │ Name Abbing, Mr. Anthony van Melkebeke, Mr.
5 │ Sex female male ⋯
6 │ Age 29.6991 0.42 28.0 80.0
7 │ SibSp 0.523008 0 0.0 8
8 │ Parch 0.381594 0 0.0 6
9 │ Ticket 110152 WE/P 5735 ⋯
10 │ Fare 32.2042 0.0 14.4542 512.329
11 │ Cabin A10 T
12 │ Embarked C S
3 columns omitted
Let’s check the types of the data in each of the columns
schema(df_train)
┌─────────────┬────────────────────────────┬──────────────────────────┐
│ names │ scitypes │ types │
├─────────────┼────────────────────────────┼──────────────────────────┤
│ PassengerId │ Count │ Int64 │
│ Survived │ Count │ Int64 │
│ Pclass │ Count │ Int64 │
│ Name │ Textual │ String │
│ Sex │ Textual │ String7 │
│ Age │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ SibSp │ Count │ Int64 │
│ Parch │ Count │ Int64 │
│ Ticket │ Textual │ String31 │
│ Fare │ Continuous │ Float64 │
│ Cabin │ Union{Missing, Textual} │ Union{Missing, String15} │
│ Embarked │ Union{Missing, Textual} │ Union{Missing, String1} │
└─────────────┴────────────────────────────┴──────────────────────────┘
schema(df_test)
┌─────────────┬────────────────────────────┬──────────────────────────┐
│ names │ scitypes │ types │
├─────────────┼────────────────────────────┼──────────────────────────┤
│ PassengerId │ Count │ Int64 │
│ Pclass │ Count │ Int64 │
│ Name │ Textual │ String │
│ Sex │ Textual │ String7 │
│ Age │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ SibSp │ Count │ Int64 │
│ Parch │ Count │ Int64 │
│ Ticket │ Textual │ String31 │
│ Fare │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ Cabin │ Union{Missing, Textual} │ Union{Missing, String15} │
│ Embarked │ Textual │ String1 │
└─────────────┴────────────────────────────┴──────────────────────────┘
I created a function to prepare the data and select the features to be used in the machine learning step. I selected the columns which I expect to be most relevant and recode them such that they are all numeric. Finally I apply ZScore to scale the features.
function preprocess(df)
#select columns
# df2 = df[:, [:PassengerId, :Pclass, :Sex, :SibSp, :Parch]]
df2 = select(df,
Not(
["PassengerId", "Name", "Ticket", "Cabin"]
)
)
df2.Embarked = replace(
df2.Embarked, "S" => 1, "C" => 2, "Q" => 3
)
recode!(df2.Embarked , missing => 0)
df2.Embarked= convert.(Int64,df2.Embarked)
df2.Embarked = disallowmissing(df2.Embarked)
#remove missings from Age
age = df.Age
recode!(age, missing => mean(skipmissing(age)))
df2[:, :Age] = age
df2.Age = disallowmissing(df2.Age)
#remove missings from Fare
fare = df.Fare
recode!(fare, missing => mean(skipmissing(fare)))
df2[:, :Fare] = fare
df2.Fare = disallowmissing(df2.Fare)
#recode Sex from String to boolean
df2.Sex = ifelse.(df2.Sex .== "male", 0, 1)
# f1 = Coerce(:Pclass => Continuous, :Sex => Continuous)
# df2 = df2 |> f1
f2 = ZScore( :Fare,:Age )
df2 = df2 |> f2
passengerId = df[:,"PassengerId"]
passengerId, df2
end
preprocess (generic function with 1 method)
Then I process the training set and the test set..
passengerIdTrain, df_train2 = preprocess(df_train);
passengerIdTest, df_test2 = preprocess(df_test);
df_train2[:, :Survived] = df_train.Survived;
coerce!(df_train2, :Survived => Multiclass);
schema(df_train2)
y, X = unpack(df_train2, ==(:Survived); rng=123)
(CategoricalArrays.CategoricalValue{Int64, UInt32}[1, 0, 0, 0, 1, 0, 0, 1, 1, 1 … 1, 0, 0, 0, 0, 1
, 0, 0, 1, 0], 891×7 DataFrame
Row │ Pclass Sex Age SibSp Parch Fare Embarked
│ Int64 Int64 Float64 Int64 Int64 Float64 Int64
─────┼───────────────────────────────────────────────────────────────
1 │ 3 1 0.100052 0 0 -0.47332 1
2 │ 3 0 -2.13037 4 1 -0.0619641 3
3 │ 3 0 0.0 0 0 -0.489167 1
4 │ 2 0 0.330786 1 0 -0.12485 1
5 │ 1 1 1.86901 1 0 0.926933 2
6 │ 3 0 0.0615968 0 0 -0.486064 1
7 │ 3 0 0.176964 0 0 -0.479776 1
8 │ 3 0 0.176964 0 0 -0.486064 1
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
885 │ 3 0 -0.476781 0 0 -0.502582 2
886 │ 3 0 -0.130681 0 0 -0.489167 1
887 │ 1 1 -1.13053 0 1 3.60477 1
888 │ 2 0 -0.515237 0 0 -0.386454 1
889 │ 3 0 -0.822881 0 0 -0.648058 1
890 │ 3 0 1.0999 0 0 -0.48858 1
891 │ 3 0 0.0 0 0 -0.492101 3
876 rows omitted)
schema(df_train2)
┌──────────┬───────────────┬─────────────────────────────────┐
│ names │ scitypes │ types │
├──────────┼───────────────┼─────────────────────────────────┤
│ Survived │ Multiclass{2} │ CategoricalValue{Int64, UInt32} │
│ Pclass │ Count │ Int64 │
│ Sex │ Count │ Int64 │
│ Age │ Continuous │ Float64 │
│ SibSp │ Count │ Int64 │
│ Parch │ Count │ Int64 │
│ Fare │ Continuous │ Float64 │
│ Embarked │ Count │ Int64 │
└──────────┴───────────────┴─────────────────────────────────┘
For this experiment I will use a Random Forest classifier. Using ‘evaluate’ I can check how well it does on the training set. Better results might be achievable when using different classifiers.
RandomForest = @load RandomForestClassifier pkg = DecisionTree
randomForest = RandomForest()
evaluate(randomForest, X, y,
resampling=CV(shuffle=true),
measures=[log_loss, MLJ.accuracy],
verbosity=0)
import MLJDecisionTreeInterface ✔
PerformanceEvaluation object with these fields:
measure, operation, measurement, per_fold,
per_observation, fitted_params_per_fold,
report_per_fold, train_test_rows
Extract:
┌────────────────────────────────┬──────────────┬─────────────┬─────────┬───────
│ measure │ operation │ measurement │ 1.96*SE │ per_ ⋯
├────────────────────────────────┼──────────────┼─────────────┼─────────┼───────
│ LogLoss( │ predict │ 0.661 │ 0.283 │ [0.3 ⋯
│ tol = 2.220446049250313e-16) │ │ │ │ ⋯
│ Accuracy() │ predict_mode │ 0.82 │ N/A │ [0.8 ⋯
└────────────────────────────────┴──────────────┴─────────────┴─────────┴───────
1 column omitted
Finally I create a machine using the Random Forest classifier and train it on the test set. Then I can use that machine to predict the survivors in the test set. The score on Kaggle was 0.77272
mach = machine(randomForest, X, y)
MLJ.fit!(mach);
yhat = predict_mode(mach, df_test2)
dfOut = DataFrame();
dfOut[:, :PassengerId] = passengerIdTest;
dfOut[:, :Survived] = yhat;
dfOut
output = "C:\Git\juliacode\Data\titanic-data\submissionRandomForest.csv"
CSV.write(output, dfOut)
"C:\Git\juliacode\Data\titanic-data\submissionRandomForest.csv"
Neural network
Neural networks are not very well suited for small datasets like the Titanic dataset, but let’s check how it compares again the RandomForest classifier
model = Chain(
Dense(size(X,2) => size(X,2)),
Dense(size(X,2) => size(X,2)),
Dense(size(X,2) => 2),
softmax
) |> gpu
Chain(
Dense(7 => 7), # 56 parameters
Dense(7 => 7), # 56 parameters
Dense(7 => 2), # 16 parameters
NNlib.softmax,
) # Total: 6 arrays, 128 parameters, 856 bytes.
target = Flux.onehotbatch(y, [0,1])
2×891 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
⋅ 1 1 1 ⋅ 1 1 ⋅ ⋅ ⋅ 1 1 1 … 1 1 ⋅ 1 1 1 1 ⋅ 1 1 ⋅ 1
1 ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅
loader = Flux.DataLoader((collect(Matrix(X)'), target) |> gpu, batchsize=32, shuffle=true)
28-element DataLoader(::Tuple{CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, OneHotArrays.OneHotMa
trix{UInt32, CUDA.CuArray{UInt32, 1, CUDA.Mem.DeviceBuffer}}}, shuffle=true, batchsize=32)
with first element:
(7×32 CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 2×32 OneHotMatrix(::CUDA.CuArray{UInt32, 1,
CUDA.Mem.DeviceBuffer}) with eltype Bool,)
optim = Flux.setup(Flux.Adam(), model)
(layers = ((weight = Leaf(Adam{Float64}(0.001, (0.9, 0.999), 1.0e-8), (Float32[0.0 0.0 … 0.0 0.0; 0.
0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0
0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf(Adam{Float64}(0.001, (0
.9, 0.999), 1.0e-8), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0
.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(Adam{Float64}(0.001, (0.9, 0.999), 1.0e-8), (Float
32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], Float32[0.0 0.0
… 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], (0.9, 0.999))), bias = Leaf
(Adam{Float64}(0.001, (0.9, 0.999), 1.0e-8), (Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.
0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.999))), σ = ()), (weight = Leaf(Adam{Float64}(0.001, (0.9,
0.999), 1.0e-8), (Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0
… 0.0 0.0], (0.9, 0.999))), bias = Leaf(Adam{Float64}(0.001, (0.9, 0.999), 1.0e-8), (Float32[0.0, 0
.0], Float32[0.0, 0.0], (0.9, 0.999))), σ = ()), ()),)
losses = []
for epoch in 1:100
for (x, y) in loader
loss, grads = Flux.withgradient(model) do m
# Evaluate model and loss inside gradient context:
y_hat = m(x)
Flux.crossentropy(y_hat, y)
end
Flux.update!(optim, model, grads[1])
push!(losses, loss) # logging, outside gradient context
end
end
plot(losses)
target_hat = model(collect(Matrix(X)') |> gpu) |> cpu
(Flux.onecold(target_hat) .- 1)'
1×891 adjoint(::Vector{Int64}) with eltype Int64:
1 0 0 0 1 0 0 0 1 1 0 1 0 … 0 0 1 0 0 0 0 1 0 0 0 0
function accuracy(y_hat, y_raw)
y_hat_raw = Flux.onecold(y_hat) .- 1
count(y_hat_raw .== y_raw) / length(y_raw)
end
y_hat2 = Flux.onecold(target_hat) .- 1
accuracy(target_hat,y)
0.8047138047138047
y_test = Flux.onecold( model(collect(Matrix(df_test2)') |> gpu) |> cpu ) .- 1
418-element Vector{Int64}:
0
0
0
0
1
0
1
0
1
0
⋮
1
1
1
1
0
1
0
0
0
dfOut[:, :PassengerId] = passengerIdTest;
dfOut[:, :Survived] = y_test;
output = "C:\Git\juliacode\Data\titanic-data\submissionNeuralNetwork.csv"
CSV.write(output, dfOut)
"C:\Git\juliacode\Data\titanic-data\submissionNeuralNetwork.csv"
Conclusion
Machine learning in Julia is as easy as in Python. The RandomForest classifier works quite well on the Titanic dataset. The neural network classifier I designed doesn’t give better results then the RandomForest classifier