usingArrow # Arrow storage and file formatusingCategoricalArrays # similar to the factor type in RusingCSV # read/write CSV and similar formatsusingDownloads # file downloadsusingDataFrames # versatile tabular data formatusingGZip # utilities for compressed filesusingRCall # run R within JuliausingTar # tar archive utilities
Downloads, Gzip and Tar are included just to demonstrate downloading and extracting tar files within Julia.
You may find it easier to simply click on the download link and extract the files by hand.
Downloading and extracting within Julia, as shown here, has a greater chance of working across various operating systems and environments in a workshop like this.
RCall is included to show the use of other systems running within Julia.
You can instead use your favorite environment, such as jupyterlab or RStudio, to run Python or R
Note that the quarto notebooks for these notes are easily converted, e.g. quarto convert notebookname.qmd, to Jupyter notebooks.
Notes on Julia syntax
Boxes like this contain comments on Julia syntax and semantics in code examples. The character in the upper right corner of the box is a toggle to expand or collapse the contents of the box.
Objectives
Use an example of computing interval overlaps to introduce Julia facilities for working with tabular data.
Introduce the Arrow format for tabular data and demonstrate its use in Julia, Python/Pandas and R.
Show a naive approach to computing overlaps.
Modify the approach to use RangeTrees.jl or IntervalTrees.jl representation and to take advantage of multiple threads.
This notebook covers the first two objectives.
Task and sample data
Li Heng provides benchmark code and sample data for comparing programming languages on Bioinformatics tasks in his biofast repository.
One file, ex-anno.bed, contains a reference set of intervals; the other, ex-rna.bed, contains target intervals.
For each target interval, determine which reference intervals overlap with it.
In the benchmark both the number of reference intervals that overlap with a target and the proportion of the target covered by the overlap are computed.
Note that the calculation of the proportion of overlap must allow for overlapping intervals in the reference set, as shown in this figure
Downloading the sample data
The data sets for the benchmark are available in the compressed archive biofast-data-v1.tar.gz at biofast-data-v1.
The following code chunks download the tarball, if necessary, and extract the two bedfiles to a directory biofast-data-v1, if necessary. (Alternatively, you can just click on the link to the tarball in the github tag page and extract the files by hand.)
The ‘do/end’ block is yet another way of writing an anonymous function passed as the first argument in the call to gzopen, even though it occurs after that call in the code.
The effect is to uncompress the file into a stream, process the stream in this anonymous function, then close the stream.
Extract to temporary directory
Because Tar.extract is conservative about overwriting files and requires that the directory into which the files are extracted be empty, we extract to a freshly-created temporary directory then move the files to the desired location.
Fully qualified names for many packages of utilities
It is common for packages providing utilities to avoid name conflicts by not exporting any names from their namespace (or Module).
The fully qualified name, Tar.extract, can always be used - similar to Python naming conventions.
If a package exports a name, say foo, then after the using FooPackage directive, the unqualified name foo can be used.
The varinfo function provides a listing of the names exported by a Package (formally the package’s Module).
Compare the result below with that of, say, varinfo(DataFrames).
varinfo(Tar)
name
size
summary
Tar
873.450 KiB
Module
Initial processing
The .bed files, in this case, are simple tab-separated-value files.
Each of the language implementations in the biofast benchmark contains code to parse lines of the .bed files producing a String and two Int32 values.
Furthermore the results of these benchmarks are written out as very large text files. I don’t plan to read a several-million-line text file to check if a program is working properly.
Why go through this parsing of text files to create a numeric representation in each language?
Writing code to parse a CSV or TSV file is tedious and error prone.
But skilled people have put a lot of work into creating packages to do just that.
More importantly they have tested, debugged, and documented their packages.
Within the CSV.jl package, the CSV.read function reads and parses a file and converts it to a type specified by the second argument.
As is common for such functions, there are many, many optional named arguments
Positional arguments must come before named arguments.
Optionally, the comma after the last positional argument can be replace by a semicolon, as shown above.
It turns out that both of the .bed files contain many duplicate rows - probably not intentionally. We eliminate the duplicates with unique!.
Names of mutating functions
A function’s name ending in ! is a hint that it is a mutating function, which can modify one or more of its arguments.
We also make change the tags "chr1" up to "chr9" to "chr01" up to "chr09" so later sorting by these strings will also sort by chromosome numbers.
This is done with a replace! method, which takes the object to modify and one or more pairs of values of the form from => to. These pairs can be generated using string interpolation as
The ellipsis (...) after replacements in this call “splats” the single name into, in this case, 9 different arguments of pairs. That is, this call is the equivalent of a call to replace! with 10 arguments: annodf.chromo, "chr1" => "chr01", "chr2" => "chr02", and so on.
Representing the chromo column as a CategoricalArray converts the individual values to indices into a vector of Strings, like a factor in R. The set of possible levels of chromo is given the natural ordering.
Arrow file format
The Arrow project defines a memory and file format for storing and manipulating column-oriented, static, tables (i.e. like data frames in R, Python/Pandas, and Julia)
Either ‘lz4’ or ‘zstd’ compression can be used when creating an Arrow file.
Metadata on the names and types of columns is automatically stored. Additional column or table metadata can be specified.
bed2arrow encapsulates reading the .bed file, replacing the strings in the chromo column, reducing the data frame to unique, sorted rows, and writing the Arrow file.
Named arguments with defaults are used to configure the “usual” call but allow for variations.
The Boolean overwrite named argument, which defaults to false, controls overwriting of existing files.
Semicolon separating positional and named arguments
In the calls to CSV.read and Arrow.write, the semicolon after the positional arguments followed by argument names only indicates that the value passed for the named argument is the object of the same name in the current namespace. That is, Arrow.write(arrownm, df; compress, metadata) is equivalent to Arrow.write(arrownm, df; compress=compress, metadata=metadata).
Arrow.Table with 573806 rows, 3 columns, and schema:
:chromo String
:start Int32
:stop Int32
with metadata given by a Base.ImmutableDict{String, String} with 1 entry:
"url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1"
Although the schema describes the chromo column as Strings the values are dictionary encoded such that each value is represented by one byte.
Arrow.Table with 4685080 rows, 3 columns, and schema:
:chromo String
:start Int32
:stop Int32
with metadata given by a Base.ImmutableDict{String, String} with 1 entry:
"url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1"
We can use operations like split-apply-combine on these tables to summarize properties
In the next section we will use data from chr01 for comparative timings. It has the greatest number of intervals in both the reference and target groups.
Reading Arrow files in R
In R (and in Python) the Arrow file format is confounded with an earlier file format called Feather and referred to as Feather V2.
In R the arrow::read_feather function returns a tibble. In an R session it looks like
read_table returns a Table object, read_feather returns a Pandas dataframe.
The PyCall package for Julia starts a Python process and allows communication with it, including data transfer.
I use this instead of the Python REPL when working with both Julia and Python.
Configuring Python, Conda, pyarrow, pandas, and PyCall across platforms is sufficiently complicated to almost surely cause failures for some workshop participants. Instead of evaluating this code chunk we quote the results.
julia>usingPyCalljulia> fea =pyimport("pyarrow.feather");julia> fea.read_table(rnaarrownm)PyObject pyarrow.Tablechromo: dictionary<values=string, indices=int8, ordered=0> not nullstart: int32 not nullstop: int32 not null----chromo: [ -- dictionary:["chr01","chr02","chr03","chr04","chr05",...,"chr20","chr21","chr22","chrX","chrY"] -- indices:[1,6,18,9,16,...,16,15,10,22,0]]start: [[216499331,101239611,49487626,80155590,76270411,...,7014179,75627747,59636724,153785767,182839364]]stop: [[216501458,101245071,49491841,80169336,76271290,...,7014515,75631483,59666963,153787586,182887745]]
Conclusions
Julia provides an impressive array of tools for Bioinformatics and similar tasks
A general framework for working with tabular data is provided in Tables.jl (not shown here). Alternatively TypedTables.jl provides a “lean and mean” implementation of both row- and column-oriented tabular data structures.
We have shown the use of PyCall.jl and RCall.jl for running and communicating with other language systems within Julia.
We have shown the use of utility packages Downloads.jl, GZip.jl and Tar.jl for scripting, say within Quarto documents like these.
Take a moment to look at the repositories for some of these Julia packages. Most (all?) of them are 100% Julia code.
Version information
versioninfo()
Julia Version 1.8.0-rc3
Commit 33f19bcbd25 (2022-07-13 19:10 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
Threads: 1 on 8 virtual cores