2a: Data Tables and Arrow files

Author

Douglas Bates and Claudia Solis-Lemus

Published

2022-07-30

Load packages to be used

Code
using Arrow             # Arrow storage and file format
using CategoricalArrays # similar to the factor type in R
using CSV               # read/write CSV and similar formats
using Downloads         # file downloads
using DataFrames        # versatile tabular data format
using GZip              # utilities for compressed files
using RCall             # run R within Julia
using Tar               # tar archive utilities
  • Downloads, Gzip and Tar are included just to demonstrate downloading and extracting tar files within Julia.
  • You may find it easier to simply click on the download link and extract the files by hand.
  • Downloading and extracting within Julia, as shown here, has a greater chance of working across various operating systems and environments in a workshop like this.
  • RCall is included to show the use of other systems running within Julia.
  • You can instead use your favorite environment, such as jupyterlab or RStudio, to run Python or R
  • Note that the quarto notebooks for these notes are easily converted, e.g. quarto convert notebookname.qmd, to Jupyter notebooks.
Notes on Julia syntax

Boxes like this contain comments on Julia syntax and semantics in code examples. The character in the upper right corner of the box is a toggle to expand or collapse the contents of the box.

Objectives

  • Use an example of computing interval overlaps to introduce Julia facilities for working with tabular data.
  • Introduce the Arrow format for tabular data and demonstrate its use in Julia, Python/Pandas and R.
  • Show a naive approach to computing overlaps.
  • Modify the approach to use RangeTrees.jl or IntervalTrees.jl representation and to take advantage of multiple threads.

This notebook covers the first two objectives.

Task and sample data

  • Li Heng provides benchmark code and sample data for comparing programming languages on Bioinformatics tasks in his biofast repository.
  • One of these tasks, an interval query, takes two .bed files to compare.
  • One file, ex-anno.bed, contains a reference set of intervals; the other, ex-rna.bed, contains target intervals.
  • For each target interval, determine which reference intervals overlap with it.
  • In the benchmark both the number of reference intervals that overlap with a target and the proportion of the target covered by the overlap are computed.
  • Note that the calculation of the proportion of overlap must allow for overlapping intervals in the reference set, as shown in this figure

Downloading the sample data

  • The data sets for the benchmark are available in the compressed archive biofast-data-v1.tar.gz at biofast-data-v1.
  • The following code chunks download the tarball, if necessary, and extract the two bedfiles to a directory biofast-data-v1, if necessary. (Alternatively, you can just click on the link to the tarball in the github tag page and extract the files by hand.)
datadir = "biofast-data-v1"
tarball = "$datadir.tar.gz"
if !isfile(tarball)
  dataurl = join(
    ["https://github.com/lh3/biofast/releases/download", datadir, tarball],
    '/',
  )
  Downloads.download(dataurl, tarball)
end
run(`ls -lh $tarball`);
-rw-rw-r-- 1 bates bates 541M Jul 30 10:51 biofast-data-v1.tar.gz

An expression like "$datadir.tar.gz" interpolates the value of datadir into the string, producing the name shown in the output.

Redundant commas are allowed at the end of the list of arguments before the ) in a function call.

  • The .tar.gz file is about 0.5 GB. but most of that is the data for the FASTQ parsing test.

Extract the files of interest (if not already present)

isdir(datadir) || mkdir(datadir)
bedfnms = joinpath.(datadir, ["ex-anno.bed", "ex-rna.bed"])
toextract = filter(!isfile, bedfnms)  # don't overwrite existing files
if !isempty(toextract)
  tmpdir = gzopen(tarball, "r") do io
    Tar.extract(h -> in(h.path, toextract), io)
  end
  for pathnm in toextract
    mv(joinpath(tmpdir, pathnm), pathnm; force = true)
  end
end
filter(endswith(".bed"), readdir(datadir))
2-element Vector{String}:
 "ex-anno.bed"
 "ex-rna.bed"

The call joinpath.(datadir, ["ex-anno.bed", "ex-rna.bed"]) is an example of dot vectorization.

The expression h -> in(h.path, toextract) defines an anonymous function, in the “stabby lambda” syntax, to be used as a predicate in Tar.extract.

methods(Tar.extract)
# 4 methods for generic function extract:

The ‘do/end’ block is yet another way of writing an anonymous function passed as the first argument in the call to gzopen, even though it occurs after that call in the code.

methods(gzopen)
# 4 methods for generic function gzopen:

The effect is to uncompress the file into a stream, process the stream in this anonymous function, then close the stream.

Because Tar.extract is conservative about overwriting files and requires that the directory into which the files are extracted be empty, we extract to a freshly-created temporary directory then move the files to the desired location.

  • It is common for packages providing utilities to avoid name conflicts by not exporting any names from their namespace (or Module).
  • The fully qualified name, Tar.extract, can always be used - similar to Python naming conventions.
  • If a package exports a name, say foo, then after the using FooPackage directive, the unqualified name foo can be used.
  • The varinfo function provides a listing of the names exported by a Package (formally the package’s Module).
  • Compare the result below with that of, say, varinfo(DataFrames).
varinfo(Tar)
name size summary
Tar 873.450 KiB Module

Initial processing

  • The .bed files, in this case, are simple tab-separated-value files.
  • Each of the language implementations in the biofast benchmark contains code to parse lines of the .bed files producing a String and two Int32 values.
  • Furthermore the results of these benchmarks are written out as very large text files. I don’t plan to read a several-million-line text file to check if a program is working properly.
  • Why go through this parsing of text files to create a numeric representation in each language?
  • Writing code to parse a CSV or TSV file is tedious and error prone.
  • But skilled people have put a lot of work into creating packages to do just that.
  • More importantly they have tested, debugged, and documented their packages.
  • Within the CSV.jl package, the CSV.read function reads and parses a file and converts it to a type specified by the second argument.
  • As is common for such functions, there are many, many optional named arguments
  • We read ex-anno.bed and create a DataFrame as
annodf = CSV.read(
  joinpath(datadir, "ex-anno.bed"),
  DataFrame;
  delim = '\t',
  types = [String, Int32, Int32],
  header = ["chromo", "start", "stop"],
)

1,193,657 rows × 3 columns

chromostartstop
StringInt32Int32
1chr11186812227
2chr11261212721
3chr11322014409
4chr11200912057
5chr11217812227
6chr11261212697
7chr11297413052
8chr11322013374
9chr11345213670
10chr12953329570
11chr12473724891
12chr11826718366
13chr11791418061
14chr11760517742
15chr11723217368
16chr11685717055
17chr11660616765
18chr11579515947
19chr11500415038
20chr11440314501
21chr11736817436
22chr12955330039
23chr13056330667
24chr13097531097
25chr13026630667
26chr13097531109
27chr13036530503
28chr13572036081
29chr13527635481
30chr13455335174
  • Positional arguments must come before named arguments.

  • Optionally, the comma after the last positional argument can be replace by a semicolon, as shown above.

  • It turns out that both of the .bed files contain many duplicate rows - probably not intentionally. We eliminate the duplicates with unique!.

A function’s name ending in ! is a hint that it is a mutating function, which can modify one or more of its arguments.

  • We also make change the tags "chr1" up to "chr9" to "chr01" up to "chr09" so later sorting by these strings will also sort by chromosome numbers.

  • This is done with a replace! method, which takes the object to modify and one or more pairs of values of the form from => to. These pairs can be generated using string interpolation as

replacements = ["chr$i" => "chr0$i" for i = 1:9]
9-element Vector{Pair{String, String}}:
 "chr1" => "chr01"
 "chr2" => "chr02"
 "chr3" => "chr03"
 "chr4" => "chr04"
 "chr5" => "chr05"
 "chr6" => "chr06"
 "chr7" => "chr07"
 "chr8" => "chr08"
 "chr9" => "chr09"

The expression on the right of the = is called a “comprehension”. I think of it as an inside-out loop.

show(unique(replace!(annodf.chromo, replacements...)))
["chr01", "chr02", "chr03", "chr04", "chr05", "chr06", "chr07", "chrX", "chr08", "chr09", "chr11", "chr10", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr20", "chr19", "chrY", "chr22", "chr21"]

The ellipsis (...) after replacements in this call “splats” the single name into, in this case, 9 different arguments of pairs. That is, this call is the equivalent of a call to replace! with 10 arguments: annodf.chromo, "chr1" => "chr01", "chr2" => "chr02", and so on.

unique!(annodf)
annodf.chromo = categorical(
  annodf.chromo;
  ordered = true,
  levels = sort(levels(annodf.chromo)),
)
annodf

573,806 rows × 3 columns

chromostartstop
Cat…Int32Int32
1chr011186812227
2chr011261212721
3chr011322014409
4chr011200912057
5chr011217812227
6chr011261212697
7chr011297413052
8chr011322013374
9chr011345213670
10chr012953329570
11chr012473724891
12chr011826718366
13chr011791418061
14chr011760517742
15chr011723217368
16chr011685717055
17chr011660616765
18chr011579515947
19chr011500415038
20chr011440314501
21chr011736817436
22chr012955330039
23chr013056330667
24chr013097531097
25chr013026630667
26chr013097531109
27chr013036530503
28chr013572036081
29chr013527635481
30chr013455335174
show(levels(annodf.chromo))
["chr01", "chr02", "chr03", "chr04", "chr05", "chr06", "chr07", "chr08", "chr09", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chrY"]

Representing the chromo column as a CategoricalArray converts the individual values to indices into a vector of Strings, like a factor in R. The set of possible levels of chromo is given the natural ordering.

Arrow file format

  • The Arrow project defines a memory and file format for storing and manipulating column-oriented, static, tables (i.e. like data frames in R, Python/Pandas, and Julia)
  • Either ‘lz4’ or ‘zstd’ compression can be used when creating an Arrow file.
  • Metadata on the names and types of columns is automatically stored. Additional column or table metadata can be specified.
Arrow.write(
  "./biofast-data-v1/ex-anno.arrow",
  annodf;
  compress = :lz4,
  metadata = [
    "url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1",
  ],
)
"./biofast-data-v1/ex-anno.arrow"
  • bed2arrow encapsulates reading the .bed file, replacing the strings in the chromo column, reducing the data frame to unique, sorted rows, and writing the Arrow file.

  • Named arguments with defaults are used to configure the “usual” call but allow for variations.

  • The Boolean overwrite named argument, which defaults to false, controls overwriting of existing files.

metadata =
  ["url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1"]
function bed2arrow(
  fnroot::AbstractString;
  datadir = datadir,
  delim = '\t',
  header = [:chromo, :start, :stop],
  types = [String, Int32, Int32],
  compress = :lz4,
  metadata = metadata,
  replacements = replacements,
  ordered = true,
  overwrite::Bool = false,
)
  bednm = joinpath(datadir, "$fnroot.bed")
  arrownm = joinpath(datadir, "$fnroot.arrow")
  if overwrite || !isfile(arrownm)
    df = unique!(CSV.read(bednm, DataFrame; header, delim, types))
    replace!(df.chromo, replacements...)
    df.chromo =
      categorical(df.chromo; ordered, levels = sort(unique(df.chromo)))
    Arrow.write(arrownm, df; compress, metadata)
  end
  return arrownm
end
rnaarrownm = bed2arrow("ex-rna")
"biofast-data-v1/ex-rna.arrow"
run(`ls -lh $datadir`);
total 273M
-rw-rw-r-- 1 bates bates 3.9M Jul 30 10:53 ex-anno.arrow
-rw-r--r-- 1 bates bates  28M Jul 30 10:53 ex-anno.bed
-rw-rw-r-- 1 bates bates  39M Jul 30 10:54 ex-rna.arrow
-rw-r--r-- 1 bates bates 203M Jul 30 10:53 ex-rna.bed

In the calls to CSV.read and Arrow.write, the semicolon after the positional arguments followed by argument names only indicates that the value passed for the named argument is the object of the same name in the current namespace. That is, Arrow.write(arrownm, df; compress, metadata) is equivalent to Arrow.write(arrownm, df; compress=compress, metadata=metadata).

Reading Arrow files in Julia

annotbl = Arrow.Table(joinpath(datadir, "ex-anno.arrow"))
Arrow.Table with 573806 rows, 3 columns, and schema:
 :chromo  String
 :start   Int32
 :stop    Int32

with metadata given by a Base.ImmutableDict{String, String} with 1 entry:
  "url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1"
  • Although the schema describes the chromo column as Strings the values are dictionary encoded such that each value is represented by one byte.
typeof(annotbl.chromo)
Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}
@time rnatbl = Arrow.Table(rnaarrownm)
  0.031670 seconds (317 allocations: 77.842 MiB)
Arrow.Table with 4685080 rows, 3 columns, and schema:
 :chromo  String
 :start   Int32
 :stop    Int32

with metadata given by a Base.ImmutableDict{String, String} with 1 entry:
  "url" => "https://github.com/lh3/biofast/releases/tag/biofast-data-v1"
  • We can use operations like split-apply-combine on these tables to summarize properties
annogdf = groupby(DataFrame(annotbl), :chromo)
rnagdf = groupby(DataFrame(rnatbl), :chromo)
innerjoin(
  combine(rnagdf, nrow => :nrna),
  combine(annogdf, nrow => :nanno);
  on = :chromo,
)

24 rows × 3 columns

chromonrnananno
StringInt64Int64
1chr0145311452789
2chr0229827842563
3chr0325185834769
4chr0413278122500
5chr0522589826085
6chr0630774724935
7chr0721703827613
8chr0814120521603
9chr0916518120189
10chr1016225620061
11chr1128658434021
12chr1229200233524
13chr13767799464
14chr1417140219995
15chr1517173622187
16chr1619259528276
17chr1728834835888
18chr186042910179
19chr1934477535451
20chr2010997912387
21chr21517206518
22chr2213295612724
23chrX15033017618
24chrY892467
  • In the next section we will use data from chr01 for comparative timings. It has the greatest number of intervals in both the reference and target groups.

Reading Arrow files in R

  • In R (and in Python) the Arrow file format is confounded with an earlier file format called Feather and referred to as Feather V2.

  • In R the arrow::read_feather function returns a tibble. In an R session it looks like

> library(tibble)
> arrow::read_feather("biofast-data-v1/ex-rna.arrow")
# A tibble: 4,685,080 × 3
   chromo     start      stop
   <fct>      <int>     <int>
 1 chr02  216499331 216501458
 2 chr07  101239611 101245071
 3 chr19   49487626  49491841
 4 chr10   80155590  80169336
 5 chr17   76270411  76271290
 6 chr06   31268756  31272069
 7 chr05  170083214 170083368
 8 chr19   51989731  51989996
 9 chr18   55225980  55226732
10 chr16   84565611  84566066
# … with 4,685,070 more rows
  • The RCall package in Julia allows for running an R process within a Julia session.
  • One way of executing R code with RCall is to prepend R to a string. This causes the string to be evaluated in R.
  • $-interpolation in the string causes a Julia object to be copied into the R environment and its name in R interpolated.
R"""
library(tibble)
glimpse(rnatbl <- arrow::read_feather($rnaarrownm))
""";
Rows: 4,685,080
Columns: 3
$ chromo <fct> chr02, chr07, chr19, chr10, chr17, chr06, chr05, chr19, chr18, …
$ start  <int> 216499331, 101239611, 49487626, 80155590, 76270411, 31268756, 1…
$ stop   <int> 216501458, 101245071, 49491841, 80169336, 76271290, 31272069, 1…

Reading Arrow files in Python

  • The pyarrow package includes pyarrow.feather. Its use in a Python session looks like
>>> import pyarrow.feather as fea
>>> fea.read_table('./biofast-data-v1/ex-rna.arrow')
pyarrow.Table
chromo: dictionary<values=string, indices=int8, ordered=0> not null
start: int32 not null
stop: int32 not null
----
chromo: [  -- dictionary:
["chr01","chr02","chr03","chr04","chr05",...,"chr20","chr21","chr22","chrX","chrY"]  -- indices:
[1,6,18,9,16,...,16,15,10,22,0]]
start: [[216499331,101239611,49487626,80155590,76270411,...,7014179,75627747,59636724,153785767,182839364]]
stop: [[216501458,101245071,49491841,80169336,76271290,...,7014515,75631483,59666963,153787586,182887745]]
>>> fea.read_feather('./biofast-data-v1/ex-rna.arrow')
        chromo      start       stop
0        chr02  216499331  216501458
1        chr07  101239611  101245071
2        chr19   49487626   49491841
3        chr10   80155590   80169336
4        chr17   76270411   76271290
...        ...        ...        ...
4685075  chr17    7014179    7014515
4685076  chr16   75627747   75631483
4685077  chr11   59636724   59666963
4685078   chrX  153785767  153787586
4685079  chr01  182839364  182887745

[4685080 rows x 3 columns]
  • read_table returns a Table object, read_feather returns a Pandas dataframe.

  • The PyCall package for Julia starts a Python process and allows communication with it, including data transfer.

  • I use this instead of the Python REPL when working with both Julia and Python.

  • Configuring Python, Conda, pyarrow, pandas, and PyCall across platforms is sufficiently complicated to almost surely cause failures for some workshop participants. Instead of evaluating this code chunk we quote the results.

julia> using PyCall

julia> fea = pyimport("pyarrow.feather");

julia> fea.read_table(rnaarrownm)
PyObject pyarrow.Table
chromo: dictionary<values=string, indices=int8, ordered=0> not null
start: int32 not null
stop: int32 not null
----
chromo: [  -- dictionary:
["chr01","chr02","chr03","chr04","chr05",...,"chr20","chr21","chr22","chrX","chrY"]  -- indices:
[1,6,18,9,16,...,16,15,10,22,0]]
start: [[216499331,101239611,49487626,80155590,76270411,...,7014179,75627747,59636724,153785767,182839364]]
stop: [[216501458,101245071,49491841,80169336,76271290,...,7014515,75631483,59666963,153787586,182887745]]

Conclusions

  • Julia provides an impressive array of tools for Bioinformatics and similar tasks
  • We have shown the use of Arrow.jl, CategoricalArrays.jl CSV.jl, and DataFrames for data input, storage and manipulation.
  • Although not shown here DataFrameMacros.jl (or DataFramesMeta.jl) and Chain.jl are worth considering for more sophisticated work with DataFrames.
  • A general framework for working with tabular data is provided in Tables.jl (not shown here). Alternatively TypedTables.jl provides a “lean and mean” implementation of both row- and column-oriented tabular data structures.
  • We have shown the use of PyCall.jl and RCall.jl for running and communicating with other language systems within Julia.
  • We have shown the use of utility packages Downloads.jl, GZip.jl and Tar.jl for scripting, say within Quarto documents like these.
  • Take a moment to look at the repositories for some of these Julia packages. Most (all?) of them are 100% Julia code.

Version information

versioninfo()
Julia Version 1.8.0-rc3
Commit 33f19bcbd25 (2022-07-13 19:10 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores