2a: Data Tables and Arrow files

Author

Douglas Bates and Claudia Solis-Lemus

Published

2022-07-30

Load packages to be used

Code

using Arrow             # Arrow storage and file format
using CategoricalArrays # similar to the factor type in R
using CSV               # read/write CSV and similar formats
using Downloads         # file downloads
using DataFrames        # versatile tabular data format
using GZip              # utilities for compressed files
using RCall             # run R within Julia
using Tar               # tar archive utilities

Downloads, Gzip and Tar are included just to demonstrate downloading and extracting tar files within Julia.
You may find it easier to simply click on the download link and extract the files by hand.
Downloading and extracting within Julia, as shown here, has a greater chance of working across various operating systems and environments in a workshop like this.
RCall is included to show the use of other systems running within Julia.
You can instead use your favorite environment, such as jupyterlab or RStudio, to run Python or R
Note that the quarto notebooks for these notes are easily converted, e.g. quarto convert notebookname.qmd, to Jupyter notebooks.

Notes on Julia syntax

Boxes like this contain comments on Julia syntax and semantics in code examples. The character in the upper right corner of the box is a toggle to expand or collapse the contents of the box.

Objectives

Use an example of computing interval overlaps to introduce Julia facilities for working with tabular data.
Introduce the Arrow format for tabular data and demonstrate its use in Julia, Python/Pandas and R.
Show a naive approach to computing overlaps.
Modify the approach to use RangeTrees.jl or IntervalTrees.jl representation and to take advantage of multiple threads.

This notebook covers the first two objectives.

Task and sample data

Li Heng provides benchmark code and sample data for comparing programming languages on Bioinformatics tasks in his biofast repository.
One of these tasks, an interval query, takes two .bed files to compare.
One file, ex-anno.bed, contains a reference set of intervals; the other, ex-rna.bed, contains target intervals.
For each target interval, determine which reference intervals overlap with it.
In the benchmark both the number of reference intervals that overlap with a target and the proportion of the target covered by the overlap are computed.
Note that the calculation of the proportion of overlap must allow for overlapping intervals in the reference set, as shown in this figure

Downloading the sample data

The data sets for the benchmark are available in the compressed archive biofast-data-v1.tar.gz at biofast-data-v1.
The following code chunks download the tarball, if necessary, and extract the two bedfiles to a directory biofast-data-v1, if necessary. (Alternatively, you can just click on the link to the tarball in the github tag page and extract the files by hand.)

datadir = "biofast-data-v1"
tarball = "$datadir.tar.gz"
if !isfile(tarball)
  dataurl = join(
    ["https://github.com/lh3/biofast/releases/download", datadir, tarball],
    '/',
  )
  Downloads.download(dataurl, tarball)
end
run(`ls -lh $tarball`);

-rw-rw-r-- 1 bates bates 541M Jul 30 10:51 biofast-data-v1.tar.gz

String interpolation

An expression like "$datadir.tar.gz" interpolates the value of datadir into the string, producing the name shown in the output.

Redundant commas in argument lists

Redundant commas are allowed at the end of the list of arguments before the ) in a function call.

The .tar.gz file is about 0.5 GB. but most of that is the data for the FASTQ parsing test.

Extract the files of interest (if not already present)

isdir(datadir) || mkdir(datadir)
bedfnms = joinpath.(datadir, ["ex-anno.bed", "ex-rna.bed"])
toextract = filter(!isfile, bedfnms)  # don't overwrite existing files
if !isempty(toextract)
  tmpdir = gzopen(tarball, "r") do io
    Tar.extract(h -> in(h.path, toextract), io)
  end
  for pathnm in toextract
    mv(joinpath(tmpdir, pathnm), pathnm; force = true)
  end
end
filter(endswith(".bed"), readdir(datadir))

2-element Vector{String}:
 "ex-anno.bed"
 "ex-rna.bed"

Dot vectorization

The call joinpath.(datadir, ["ex-anno.bed", "ex-rna.bed"]) is an example of dot vectorization.

Stabby lambda syntax for anonymous functions

The expression h -> in(h.path, toextract) defines an anonymous function, in the “stabby lambda” syntax, to be used as a predicate in Tar.extract.

methods(Tar.extract)

# 4 methods for generic function extract:

extract(predicate::Function, tarball::Union{Base.AbstractCmd, AbstractString, IO}) in Tar at /home/bates/.julia/juliaup/julia-1.8.0-rc3+0.x64/share/julia/stdlib/v1.8/Tar/src/Tar.jl:218
extract(predicate::Function, tarball::Union{Base.AbstractCmd, AbstractString, IO}, dir::Union{Nothing, AbstractString}; skeleton, copy_symlinks, set_permissions) in Tar at /home/bates/.julia/juliaup/julia-1.8.0-rc3+0.x64/share/julia/stdlib/v1.8/Tar/src/Tar.jl:218
extract(tarball::Union{Base.AbstractCmd, AbstractString, IO}) in Tar at /home/bates/.julia/juliaup/julia-1.8.0-rc3+0.x64/share/julia/stdlib/v1.8/Tar/src/Tar.jl:248
extract(tarball::Union{Base.AbstractCmd, AbstractString, IO}, dir::Union{Nothing, AbstractString}; skeleton, copy_symlinks, set_permissions) in Tar at /home/bates/.julia/juliaup/julia-1.8.0-rc3+0.x64/share/julia/stdlib/v1.8/Tar/src/Tar.jl:248

do/end blocks

The ‘do/end’ block is yet another way of writing an anonymous function passed as the first argument in the call to gzopen, even though it occurs after that call in the code.

methods(gzopen)

# 4 methods for generic function gzopen:

gzopen(fname::AbstractString) in GZip at /home/bates/.julia/packages/GZip/JNmGn/src/GZip.jl:264
gzopen(fname::AbstractString, gzmode::AbstractString) in GZip at /home/bates/.julia/packages/GZip/JNmGn/src/GZip.jl:263
gzopen(fname::AbstractString, gzmode::AbstractString, gz_buf_size::Integer) in GZip at /home/bates/.julia/packages/GZip/JNmGn/src/GZip.jl:236
gzopen(f::Function, args...) in GZip at /home/bates/.julia/packages/GZip/JNmGn/src/GZip.jl:267

The effect is to uncompress the file into a stream, process the stream in this anonymous function, then close the stream.

Extract to temporary directory

Because Tar.extract is conservative about overwriting files and requires that the directory into which the files are extracted be empty, we extract to a freshly-created temporary directory then move the files to the desired location.

Fully qualified names for many packages of utilities

It is common for packages providing utilities to avoid name conflicts by not exporting any names from their namespace (or Module).
The fully qualified name, Tar.extract, can always be used - similar to Python naming conventions.
If a package exports a name, say foo, then after the using FooPackage directive, the unqualified name foo can be used.
The varinfo function provides a listing of the names exported by a Package (formally the package’s Module).
Compare the result below with that of, say, varinfo(DataFrames).

varinfo(Tar)

name	size	summary
Tar	873.450 KiB	Module

Initial processing

The .bed files, in this case, are simple tab-separated-value files.
Each of the language implementations in the biofast benchmark contains code to parse lines of the .bed files producing a String and two Int32 values.
Furthermore the results of these benchmarks are written out as very large text files. I don’t plan to read a several-million-line text file to check if a program is working properly.
Why go through this parsing of text files to create a numeric representation in each language?
Writing code to parse a CSV or TSV file is tedious and error prone.
But skilled people have put a lot of work into creating packages to do just that.
More importantly they have tested, debugged, and documented their packages.
Within the CSV.jl package, the CSV.read function reads and parses a file and converts it to a type specified by the second argument.
As is common for such functions, there are many, many optional named arguments
We read ex-anno.bed and create a DataFrame as

annodf = CSV.read(
  joinpath(datadir, "ex-anno.bed"),
  DataFrame;
  delim = '\t',
  types = [String, Int32, Int32],
  header = ["chromo", "start", "stop"],
)

1,193,657 rows × 3 columns

	chromo	start	stop
	String	Int32	Int32
1	chr1	11868	12227
2	chr1	12612	12721
3	chr1	13220	14409
4	chr1	12009	12057
5	chr1	12178	12227
6	chr1	12612	12697
7	chr1	12974	13052
8	chr1	13220	13374
9	chr1	13452	13670
10	chr1	29533	29570
11	chr1	24737	24891
12	chr1	18267	18366
13	chr1	17914	18061
14	chr1	17605	17742
15	chr1	17232	17368
16	chr1	16857	17055
17	chr1	16606	16765
18	chr1	15795	15947
19	chr1	15004	15038
20	chr1	14403	14501
21	chr1	17368	17436
22	chr1	29553	30039
23	chr1	30563	30667
24	chr1	30975	31097
25	chr1	30266	30667
26	chr1	30975	31109
27	chr1	30365	30503
28	chr1	35720	36081
29	chr1	35276	35481
30	chr1	34553	35174
⋮	⋮	⋮	⋮

Positional versus named arguments

Positional arguments must come before named arguments.
Optionally, the comma after the last positional argument can be replace by a semicolon, as shown above.

It turns out that both of the .bed files contain many duplicate rows - probably not intentionally. We eliminate the duplicates with unique!.

Names of mutating functions

A function’s name ending in ! is a hint that it is a mutating function, which can modify one or more of its arguments.

We also make change the tags "chr1" up to "chr9" to "chr01" up to "chr09" so later sorting by these strings will also sort by chromosome numbers.
This is done with a replace! method, which takes the object to modify and one or more pairs of values of the form from => to. These pairs can be generated using string interpolation as

replacements = ["chr$i" => "chr0$i" for i = 1:9]

9-element Vector{Pair{String, String}}:
 "chr1" => "chr01"
 "chr2" => "chr02"
 "chr3" => "chr03"
 "chr4" => "chr04"
 "chr5" => "chr05"
 "chr6" => "chr06"
 "chr7" => "chr07"
 "chr8" => "chr08"
 "chr9" => "chr09"

Comprehensions

The expression on the right of the = is called a “comprehension”. I think of it as an inside-out loop.

show(unique(replace!(annodf.chromo, replacements...)))

["chr01", "chr02", "chr03", "chr04", "chr05", "chr06", "chr07", "chrX", "chr08", "chr09", "chr11", "chr10", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr20", "chr19", "chrY", "chr22", "chr21"]

Ellipsis in argument list

The ellipsis (...) after replacements in this call “splats” the single name into, in this case, 9 different arguments of pairs. That is, this call is the equivalent of a call to replace! with 10 arguments: annodf.chromo, "chr1" => "chr01", "chr2" => "chr02", and so on.

unique!(annodf)
annodf.chromo = categorical(
  annodf.chromo;
  ordered = true,
  levels = sort(levels(annodf.chromo)),
)
annodf

573,806 rows × 3 columns

	chromo	start	stop
	Cat…	Int32	Int32
1	chr01	11868	12227
2	chr01	12612	12721
3	chr01	13220	14409
4	chr01	12009	12057
5	chr01	12178	12227
6	chr01	12612	12697
7	chr01	12974	13052
8	chr01	13220	13374
9	chr01	13452	13670
10	chr01	29533	29570
11	chr01	24737	24891
12	chr01	18267	18366
13	chr01	17914	18061
14	chr01	17605	17742
15	chr01	17232	17368
16	chr01	16857	17055
17	chr01	16606	16765
18	chr01	15795	15947
19	chr01	15004	15038
20	chr01	14403	14501
21	chr01	17368	17436
22	chr01	29553	30039
23	chr01	30563	30667
24	chr01	30975	31097
25	chr01	30266	30667
26	chr01	30975	31109
27	chr01	30365	30503
28	chr01	35720	36081
29	chr01	35276	35481
30	chr01	34553	35174
⋮	⋮	⋮	⋮

show(levels(annodf.chromo))

["chr01", "chr02", "chr03", "chr04", "chr05", "chr06", "chr07", "chr08", "chr09", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chrY"]

Categorical arrays

Representing the chromo column as a CategoricalArray converts the individual values to indices into a vector of Strings, like a factor in R. The set of possible levels of chromo is given the natural ordering.

Arrow file format

The Arrow project defines a memory and file format for storing and manipulating column-oriented, static, tables (i.e. like data frames in R, Python/Pandas, and Julia)
Either ‘lz4’ or ‘zstd’ compression can be used when creating an Arrow file.
Metadata on the names and types of columns is automatically stored. Additional column or table metadata can be specified.

Arrow.write(
  "./biofast-data-v1/ex-anno.arrow",
  annodf;
  compress = :lz4,
  metadata = [
    "url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1",
  ],
)

"./biofast-data-v1/ex-anno.arrow"

bed2arrow encapsulates reading the .bed file, replacing the strings in the chromo column, reducing the data frame to unique, sorted rows, and writing the Arrow file.
Named arguments with defaults are used to configure the “usual” call but allow for variations.
The Boolean overwrite named argument, which defaults to false, controls overwriting of existing files.

metadata =
  ["url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1"]
function bed2arrow(
  fnroot::AbstractString;
  datadir = datadir,
  delim = '\t',
  header = [:chromo, :start, :stop],
  types = [String, Int32, Int32],
  compress = :lz4,
  metadata = metadata,
  replacements = replacements,
  ordered = true,
  overwrite::Bool = false,
)
  bednm = joinpath(datadir, "$fnroot.bed")
  arrownm = joinpath(datadir, "$fnroot.arrow")
  if overwrite || !isfile(arrownm)
    df = unique!(CSV.read(bednm, DataFrame; header, delim, types))
    replace!(df.chromo, replacements...)
    df.chromo =
      categorical(df.chromo; ordered, levels = sort(unique(df.chromo)))
    Arrow.write(arrownm, df; compress, metadata)
  end
  return arrownm
end
rnaarrownm = bed2arrow("ex-rna")

"biofast-data-v1/ex-rna.arrow"

run(`ls -lh $datadir`);

total 273M
-rw-rw-r-- 1 bates bates 3.9M Jul 30 10:53 ex-anno.arrow
-rw-r--r-- 1 bates bates  28M Jul 30 10:53 ex-anno.bed
-rw-rw-r-- 1 bates bates  39M Jul 30 10:54 ex-rna.arrow
-rw-r--r-- 1 bates bates 203M Jul 30 10:53 ex-rna.bed

Semicolon separating positional and named arguments

In the calls to CSV.read and Arrow.write, the semicolon after the positional arguments followed by argument names only indicates that the value passed for the named argument is the object of the same name in the current namespace. That is, Arrow.write(arrownm, df; compress, metadata) is equivalent to Arrow.write(arrownm, df; compress=compress, metadata=metadata).

Reading Arrow files in Julia

annotbl = Arrow.Table(joinpath(datadir, "ex-anno.arrow"))

Arrow.Table with 573806 rows, 3 columns, and schema:
 :chromo  String
 :start   Int32
 :stop    Int32

with metadata given by a Base.ImmutableDict{String, String} with 1 entry:
  "url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1"

Although the schema describes the chromo column as Strings the values are dictionary encoded such that each value is represented by one byte.

typeof(annotbl.chromo)

Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}

@time rnatbl = Arrow.Table(rnaarrownm)

  0.031670 seconds (317 allocations: 77.842 MiB)

Arrow.Table with 4685080 rows, 3 columns, and schema:
 :chromo  String
 :start   Int32
 :stop    Int32

with metadata given by a Base.ImmutableDict{String, String} with 1 entry:
  "url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1"

We can use operations like split-apply-combine on these tables to summarize properties

annogdf = groupby(DataFrame(annotbl), :chromo)
rnagdf = groupby(DataFrame(rnatbl), :chromo)
innerjoin(
  combine(rnagdf, nrow => :nrna),
  combine(annogdf, nrow => :nanno);
  on = :chromo,
)

24 rows × 3 columns

	chromo	nrna	nanno
	String	Int64	Int64
1	chr01	453114	52789
2	chr02	298278	42563
3	chr03	251858	34769
4	chr04	132781	22500
5	chr05	225898	26085
6	chr06	307747	24935
7	chr07	217038	27613
8	chr08	141205	21603
9	chr09	165181	20189
10	chr10	162256	20061
11	chr11	286584	34021
12	chr12	292002	33524
13	chr13	76779	9464
14	chr14	171402	19995
15	chr15	171736	22187
16	chr16	192595	28276
17	chr17	288348	35888
18	chr18	60429	10179
19	chr19	344775	35451
20	chr20	109979	12387
21	chr21	51720	6518
22	chr22	132956	12724
23	chrX	150330	17618
24	chrY	89	2467

In the next section we will use data from chr01 for comparative timings. It has the greatest number of intervals in both the reference and target groups.

Reading Arrow files in R

In R (and in Python) the Arrow file format is confounded with an earlier file format called Feather and referred to as Feather V2.
In R the arrow::read_feather function returns a tibble. In an R session it looks like

> library(tibble)
> arrow::read_feather("biofast-data-v1/ex-rna.arrow")
# A tibble: 4,685,080 × 3
   chromo     start      stop
   <fct>      <int>     <int>
 1 chr02  216499331 216501458
 2 chr07  101239611 101245071
 3 chr19   49487626  49491841
 4 chr10   80155590  80169336
 5 chr17   76270411  76271290
 6 chr06   31268756  31272069
 7 chr05  170083214 170083368
 8 chr19   51989731  51989996
 9 chr18   55225980  55226732
10 chr16   84565611  84566066
# … with 4,685,070 more rows

The RCall package in Julia allows for running an R process within a Julia session.
One way of executing R code with RCall is to prepend R to a string. This causes the string to be evaluated in R.
$-interpolation in the string causes a Julia object to be copied into the R environment and its name in R interpolated.

R"""
library(tibble)
glimpse(rnatbl <- arrow::read_feather($rnaarrownm))
""";

Rows: 4,685,080
Columns: 3
$ chromo <fct> chr02, chr07, chr19, chr10, chr17, chr06, chr05, chr19, chr18, …
$ start  <int> 216499331, 101239611, 49487626, 80155590, 76270411, 31268756, 1…
$ stop   <int> 216501458, 101245071, 49491841, 80169336, 76271290, 31272069, 1…

Reading Arrow files in Python

The pyarrow package includes pyarrow.feather. Its use in a Python session looks like

>>> import pyarrow.feather as fea
>>> fea.read_table('./biofast-data-v1/ex-rna.arrow')
pyarrow.Table
chromo: dictionary<values=string, indices=int8, ordered=0> not null
start: int32 not null
stop: int32 not null
----
chromo: [  -- dictionary:
["chr01","chr02","chr03","chr04","chr05",...,"chr20","chr21","chr22","chrX","chrY"]  -- indices:
[1,6,18,9,16,...,16,15,10,22,0]]
start: [[216499331,101239611,49487626,80155590,76270411,...,7014179,75627747,59636724,153785767,182839364]]
stop: [[216501458,101245071,49491841,80169336,76271290,...,7014515,75631483,59666963,153787586,182887745]]
>>> fea.read_feather('./biofast-data-v1/ex-rna.arrow')
        chromo      start       stop
0        chr02  216499331  216501458
1        chr07  101239611  101245071
2        chr19   49487626   49491841
3        chr10   80155590   80169336
4        chr17   76270411   76271290
...        ...        ...        ...
4685075  chr17    7014179    7014515
4685076  chr16   75627747   75631483
4685077  chr11   59636724   59666963
4685078   chrX  153785767  153787586
4685079  chr01  182839364  182887745

[4685080 rows x 3 columns]

read_table returns a Table object, read_feather returns a Pandas dataframe.
The PyCall package for Julia starts a Python process and allows communication with it, including data transfer.
I use this instead of the Python REPL when working with both Julia and Python.
Configuring Python, Conda, pyarrow, pandas, and PyCall across platforms is sufficiently complicated to almost surely cause failures for some workshop participants. Instead of evaluating this code chunk we quote the results.

julia> using PyCall

julia> fea = pyimport("pyarrow.feather");

julia> fea.read_table(rnaarrownm)
PyObject pyarrow.Table
chromo: dictionary<values=string, indices=int8, ordered=0> not null
start: int32 not null
stop: int32 not null
----
chromo: [  -- dictionary:
["chr01","chr02","chr03","chr04","chr05",...,"chr20","chr21","chr22","chrX","chrY"]  -- indices:
[1,6,18,9,16,...,16,15,10,22,0]]
start: [[216499331,101239611,49487626,80155590,76270411,...,7014179,75627747,59636724,153785767,182839364]]
stop: [[216501458,101245071,49491841,80169336,76271290,...,7014515,75631483,59666963,153787586,182887745]]

Conclusions

Julia provides an impressive array of tools for Bioinformatics and similar tasks
We have shown the use of Arrow.jl, CategoricalArrays.jl CSV.jl, and DataFrames for data input, storage and manipulation.
Although not shown here DataFrameMacros.jl (or DataFramesMeta.jl) and Chain.jl are worth considering for more sophisticated work with DataFrames.
A general framework for working with tabular data is provided in Tables.jl (not shown here). Alternatively TypedTables.jl provides a “lean and mean” implementation of both row- and column-oriented tabular data structures.
We have shown the use of PyCall.jl and RCall.jl for running and communicating with other language systems within Julia.
We have shown the use of utility packages Downloads.jl, GZip.jl and Tar.jl for scripting, say within Quarto documents like these.
Take a moment to look at the repositories for some of these Julia packages. Most (all?) of them are 100% Julia code.

Version information

versioninfo()

Julia Version 1.8.0-rc3
Commit 33f19bcbd25 (2022-07-13 19:10 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores