5: Other tools for Data Science
ISMB 2022 Madison
So far you have learned
- Data tools with
Arrow.jl
andTables.jl
- Model fitting with
MixedModels.jl
Other Data Science tools in Julia
- Communication with other systems: R and python
- Package system
- Plotting
- Tuning performance
- Literate programming
Communication with other systems: Julia interoperability
Note: Both RCall
and PyCall
are written 100% julia
RCall
Switching between julia and R using $
:
> using RCall
julia
> foo = 1
julia1
> x <- $foo
R
> x
R1] 1 [
Macros @rget
and @rput
:
> z = 1
julia1
> @rput z
julia1
> z
R1] 1
[
> r = 2
R
> @rget r
julia2.0
> r
julia2.0
R""
string macro:
> R"rnorm(10)"
julia
RObject{RealSxp}1] 0.9515526 -2.1268329 -1.1197652 -1.3737837 -0.5308834 -0.1053615
[7] 1.0949319 -0.8180752 0.7316163 -1.3735100 [
Large chunk of code:
> y=1
julia1
> R"""
julia f<-function(x,y) x+y
ret<- f(1,$y)
"""
RObject{RealSxp}1] 2 [
A small example from this blog
Simulate data
> using Random
julia
> Random.seed!(1234)
juliaMersenneTwister(1234)
> X = randn(3,2)
julia3×2 Matrix{Float64}:
0.867347 -0.902914
-0.901744 0.864401
-0.494479 2.21188
> b = reshape([2.0, 3.0], 2,1)
julia2×1 Matrix{Float64}:
2.0
3.0
> y = X * b + randn(3,1)
julia3×1 Matrix{Float64}:
-0.4412351955236954
0.5179809120122916
6.149009488103242
Fit a model
> @rput y
julia3×1 Matrix{Float64}:
-0.4412351955236954
0.5179809120122916
6.149009488103242
> @rput X
julia3×2 Matrix{Float64}:
0.867347 -0.902914
-0.901744 0.864401
-0.494479 2.21188
> R"mod <- lm(y ~ X-1)"
julia
RObject{VecSxp}
:
Calllm(formula = y ~ X - 1)
:
Coefficients
X1 X2 2.867 3.418
> R"summary(mod)"
julia
RObject{VecSxp}
:
Calllm(formula = y ~ X - 1)
:
Residuals1 2 3
0.158301 0.148692 0.006511
:
CoefficientsPr(>|t|)
Estimate Std. Error t value 2.8669 0.2566 11.17 0.0568 .
X1 3.4180 0.1359 25.15 0.0253 *
X2 ---
: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Signif. codes
: 0.2173 on 1 degrees of freedom
Residual standard error-squared: 0.9988, Adjusted R-squared: 0.9963
Multiple R-statistic: 404.8 on 2 and 1 DF, p-value: 0.03512
F
> R"plot(X[,1],y)" julia
PyCall
Note that (@v1.8) pkg> add PyCall
will use the Conda.jl
package to install a minimal Python distribution (via Miniforge) that is private to Julia (not in your PATH).
We need to make sure that which conda
points at the conda folder inside .julia
, so we need to put ~/.julia/conda/3/bin
early on the PATH. In Mac zsh, we need to add export PATH=~/.julia/conda/3/bin:$PATH
in the ~/.zshrc
file. (Those who prefer not to conda-ize their entire environment may instead choose just to link ~/.julia/conda/3/bin/{conda,jupyter,python,python3}
somewhere on their existing path, such as ~/bin
.)
Simple example:
using PyCall
= pyimport("math")
math sin(math.pi / 4) math.
py"..."
evaluates "..."
as Python code:
"""
pyimport numpy as np
def sinpi(x):
return np.sin(np.pi * x)
"""
"sinpi"(1) py
More on Julia/python connectivity
Package system
- Starting on Julia 1.6, precompilation is much faster
- Many changes under the hood that allow things to work faster and more smoothly
- A local environment can be established and preserved with
Project.toml
andManifest.toml
files. - Use of
Artifacts.toml
allows for binary dependencies
Landscape of Julia packages for biology
- BioJulia is a combination of Julia packages for biology applications.
- Julia for Biologists is an arxiv paper the features that make Julia a perfect language for bioinformatics and computational biology.
- List of useful packages from another workshop, SMLP2022
Plotting
Performance tips
See more in Julia docs
@time
to measure performance
> x = rand(1000);
julia
> function sum_global()
julia= 0.0
s for i in x
+= i
s end
return s
end;
> @time sum_global() ## function gets compiled
julia0.017705 seconds (15.28 k allocations: 694.484 KiB)
496.84883432553846
> @time sum_global()
julia0.000140 seconds (3.49 k allocations: 70.313 KiB)
496.84883432553846
Break functions into multiple definitions
The function
using LinearAlgebra
function mynorm(A)
if isa(A, Vector)
return sqrt(real(dot(A,A)))
elseif isa(A, Matrix)
return maximum(svdvals(A))
else
error("mynorm: invalid argument")
end
end
should really be written as
norm(x::Vector) = sqrt(real(dot(x, x)))
norm(A::Matrix) = maximum(svdvals(A))
to allow the compiler to directly call the most applicable code.
Multiple dispatch
- The choice of which method to execute when a function is applied is called dispatch
- Julia allows the dispatch process to choose based on the number of arguments given, and on the types of all of the function’s arguments
- This is denoted multiple dispatch
- This is different than traditional object-oriented languages, where dispatch occurs based only on the first argument
> f(x::Float64, y::Float64) = 2x + y
juliafunction with 1 method)
f (generic
> f(2.0, 3.0)
julia7.0
> f(2.0, 3)
julia: MethodError: no method matching f(::Float64, ::Int64)
ERROR:
Closest candidates aref(::Float64, !Matched::Float64) at none:1
Compare to
> f(x::Number, y::Number) = 2x + y
juliafunction with 2 methods)
f (generic
> f(2.0, 3.0)
julia7.0
> f(2, 3.0)
julia7.0
> f(2.0, 3)
julia7.0
> f(2, 3)
julia7
Profiling
Read more in Julia docs.
> function myfunc()
julia= rand(200, 200, 400)
A maximum(A)
end
> myfunc() # run once to force compilation
julia
> using Profile
julia
> @profile myfunc()
julia
> Profile.print() julia
To see the profiling results, there are several graphical browsers (see Julia docs).
Other packages for performance
- BenchmarkTools.jl: performance tracking of Julia code
- Traceur.jl: You run your code, it tells you about any obvious performance traps
Literate programming
- quarto.org. These notes are rendered with quarto!
- Jupyter
- Pluto.jl
- Weave.jl package provides “Julia markdown” and also provides support for converting between
jmd
files and Jupyter notebooks. - Literate.jl is a simple package for literate programming (i.e. programming where documentation and code are “woven” together) and can generate Markdown, plain code and Jupyter notebook output.
- Documenter.jl is the standard tool for building webpages from Julia documentation
- Books.jl is a package designed to offer somewhat similar functionality to the
bookdown
package in R.