2b: Determining Interval Overlap

Author

Douglas Bates and Claudia Solis-Lemus

Published

2022-07-30

Load packages to be used

Code

using Arrow          # Arrow storage and file format
using BenchmarkTools # tools for benchmarking code
using DataFrames     # versatile tabular data format
using IntervalTrees  # interval trees from BioJulia
using RangeTrees     # a bespoke implementation of interval trees
using Tables         # row- or column-oriented tabular data

using Base: intersect! # not exported from Base

datadir = joinpath(@__DIR__, "biofast-data-v1");

Strategy for computing overlaps

Split the data in the Arrow data tables into a dictionary where the keys are the chromosome tag and values are the start-stop pairs expressed as a Vector{UnitRange}. For the reference ranges, sort the vectors by increasing first value.
For the reference ranges create two other dictionaries with values from RangeTrees.jl and from IntervalTrees.jl, respectively.
Benchmark the intersection with a target UnitRange from each of the representations of the reference ranges. Do this in two ways: intersect, which allocates the storage for the result, and intersect!, which modifies one of its arguments with the result.
Compute the coverage from the vector of intersections.
Apply the methods to the complete set of targets.

Creating dictionaries of Vector{UnitRange}

A UnitRange, such as 2:10, includes the end points, accessed by first and last methods.

typeof(2:10), length(2:10), first(2:10), last(2:10)

(UnitRange{Int64}, 9, 2, 10)

However, the positions in the start and stop columns in a .bed file are not both included in the range they represent. The positions correspond to base pairs in the range start:(stop - 1) as 0-based indices on the chromosome or (start + 1):stop as 1-based indices.
It doesn’t matter which one we use as long as we are consistent in creating ranges from the reference intervals and the target intervals.
We choose to start counting from 1, just as the world’s foremost expert on counting does.
We wrap this conversion in a utility function to help ensure consistency.

asrange(start, stop) = (start+one(start)):stop

asrange (generic function with 1 method)

“One-liner” function (actually method) definitions

This method definition uses the compact “one-liner” form, like the math notation f(x) = x + 1.

one(x) versus literal 1

one(x) is used instead of the literal 1 in asrange to preserve the integer type (see also ?oneunit, which is slighly more general).

st = Int32(2314)
typeof(st + 1)       # type gets promoted to Int64

Int64

typeof(st + one(st)) # type not promoted

Int32

Create a utility, chromodict to read an Arrow.Table and convert it to a Dict{Symbol, Vector{UnitRange{T}}}, assuming that the table contains columns named :chromo, :start, and :stop.
We define two methods, one that takes a “row-table”, which is a vector of named tuples, and one that takes the file name of the arrow file, reads the (column oriented)table, converts it to a row table, and then calls the first method.

function chromodict(rtbl::Vector{<:NamedTuple})
  r1 = first(rtbl)
  itype = promote_type(typeof(r1.start), typeof(r1.stop))
  vtype = Vector{UnitRange{itype}}
  dict = Dict{Symbol,vtype}()
  for (; chromo, start, stop) in rtbl
    push!(get!(dict, Symbol(chromo), vtype()), asrange(start, stop))
  end
  return dict
end
function chromodict(fnm::AbstractString)
  return chromodict(rowtable(Arrow.Table(joinpath(datadir, fnm))))
end
tarrngvecs = chromodict("ex-rna.arrow")
refrngvecs = chromodict("ex-anno.arrow")

Dict{Symbol, Vector{UnitRange{Int32}}} with 24 entries:
  :chr21 => [5011799:5011874, 5012548:5012687, 5014386:5014471, 5016935:5017145…
  :chr15 => [19878555:19878668, 19878831:19879004, 19881201:19881307, 19882277:…
  :chr10 => [14497:14604, 14061:14299, 16502:16544, 14138:14299, 44712:44901, 4…
  :chr17 => [76723:76866, 75814:75878, 71366:71556, 65830:65887, 64099:65736, 1…
  :chr07 => [12704:12822, 26965:27199, 24314:24365, 26965:27234, 31060:31194, 2…
  :chr14 => [16057472:16057622, 18333726:18333900, 18337973:18338078, 18338243:…
  :chr08 => [64269:64320, 64091:64175, 72601:72673, 78905:79775, 72617:72701, 7…
  :chr12 => [12310:12358, 12740:12824, 13102:13201, 13370:13501, 31878:32015, 2…
  :chr18 => [11103:11595, 15617:15822, 11191:11595, 13152:13354, 15617:15928, 1…
  :chrX  => [253743:253846, 254937:255091, 276322:276394, 281482:281684, 284167…
  :chr13 => [18177555:18178465, 18176018:18176170, 18174442:18174512, 18174010:…
  :chr11 => [75780:76143, 86649:87586, 125578:125927, 121258:121426, 113116:113…
  :chr22 => [10736171:10736283, 10961283:10961338, 10959067:10959136, 10950049:…
  :chr03 => [23757:23812, 23968:24501, 54293:54346, 53348:53692, 196607:196859,…
  :chr19 => [70928:70976, 66346:66499, 60951:61894, 62113:66524, 70928:70951, 6…
  :chr05 => [58198:58915, 92151:92276, 113251:113448, 139483:140716, 143047:143…
  :chr06 => [95124:95454, 105919:106856, 144536:144885, 140211:140379, 131910:1…
  :chr20 => [87250:87359, 96005:97094, 87710:87767, 96005:96533, 142369:142686,…
  :chrY  => [2784749:2784853, 2786855:2787699, 2789827:2790328, 2827982:2828218…
  :chr04 => [49096:49956, 49554:50124, 53285:53491, 59430:59556, 60058:60153, 8…
  :chr02 => [45440:46385, 38814:41627, 42809:42952, 41220:41627, 46807:46870, 4…
  :chr01 => [11869:12227, 12613:12721, 13221:14409, 12010:12057, 12179:12227, 1…
  :chr09 => [12134:12190, 12291:12340, 12726:12834, 13088:13157, 13338:13487, 1…
  :chr16 => [11555:11908, 12294:12402, 12902:14090, 11861:11908, 12294:12378, 1…

Mutating get! for a Dict

The call get!(dict, Symbol(chromo), vtype()) in chromodict returns dict[Symbol(chromo)] or the default value, which is an empty Vector{UnitRange{T}}. For the case of the default, it also installs that key/value pair in dict.

Symbols versus Strings for Dict keys

We use Symbols for the keys in these Dicts because they are easier to type and because symbol table lookup is very fast, although that doesn’t really matter when we only have 24 distinct keys.

Destructuring a struct or NamedTuple

The expression for (; chromo, start, stop) in rtbl is equivalent to three local assignments

for r in rtbl
  chromo = r.chromo
  start = r.start
  stop = r.stop
  ...
end

In other words, it is equivalent to taking the named fields of a NamedTuple or a struct and making them available in the local namespace under the same names.

Interfaces to row- and column-oriented tables in Tables.jl

Tables.jl provides interfaces for row- and column-oriented tables, allowing for one orientation to be viewed as the other. In this case an Arrow.Table is column-oriented but rowtable of this table returns a Vector{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}} to iterate over the rows. When a “method instance” of chromodict is compiled for such a table, the schema of the rows will be known.

@code_warntype chromodict(
  rowtable(Arrow.Table("biofast-data-v1/ex-anno.arrow")),
)

MethodInstance for chromodict(::

Vector{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}})
  from chromodict(rtbl::Vector{<:NamedTuple}) in Main at In[7]:1
Arguments
  #self#::Core.Const(chromodict)
  rtbl::Vector{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}}
Locals
  @_3::Union{Nothing, Tuple{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}, Int64}}
  dict::Dict{Symbol, Vector{UnitRange{Int32}}}
  vtype::Type{Vector{UnitRange{Int32}}}
  itype::Type{Int32}
  r1::NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}
  stop::Int32
  start::Int32
  chromo::String
Body::Dict{Symbol, Vector{UnitRange{Int32}}}

1 ─

(r1 = Main.first(rtbl))


│   %2  = Base.getproperty(r1, :start)::Int32
│   %3  = Main.typeof(%2)::Core.Const(Int32)
│   %4  = Base.getproperty(r1, :stop)::Int32
│   %5  = Main.typeof(%4)::Core.Const(Int32)
│         (itype = Main.promote_type(%3, %5))
│   %7  = Core.apply_type(Main.UnitRange, itype

::Core.Const(Int32))::Core.Const(UnitRange{Int32})
│         (vtype = Core.apply_type(Main.Vector, %7))
│   %9  = Core.apply_type(Main.Dict, Main.Symbol, vtype::Core.Const(Vector{UnitRange{Int32}}))::Core.Const(Dict{Symbol, Vector{UnitRange{Int32}}})
│         (dict = (%9)())
│   %11 = rtbl::Vector{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}}
│         (@_3 = Base.iterate(%11))
│   %13 = (@_3 === nothing)::Bool
│   %14 = Base.not_int(%13)::Bool
└──       goto #4 if not %14
2 ┄ %16 = @_3::Tuple{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}, Int64}
│   %17 = Core.getfield(%16, 1)::NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}
│         (chromo = Base.getproperty(%17, :chromo))
│         (start = Base.getproperty(%17, :start))
│         (stop = Base.getproperty(%17, :stop))
│   %21 = Core.getfield(%16, 2)::Int64
│   %22 = dict::Dict{Symbol, Vector{UnitRange{Int32}}}
│   %23 = Main.Symbol(chromo)::Symbol
│   %24 = (vtype::Core.Const(Vector{UnitRange{Int32}}))()::Vector{UnitRange{Int32}}
│   %25 = Main.get!(%22, %23, %24)::Vector{UnitRange{Int32}}
│   %26 = Main.asrange(start, stop)::UnitRange{Int32}
│         Main.push!(%25, %26)
│         (@_3 = Base.iterate(%11, %21))
│   %29 = (@_3 === nothing)::Bool
│   %30 = Base.not_int(%29)::Bool
└──       goto #4 if not %30
3 ─       goto #2
4 ┄       return dict

In refrngvecs, each of the values, a Vector{UnitRange}, should be sorted by the first element of the UnitRange.
Creating an IntervalTree requires the ranges to be sorted by first element and by last element when the first elements are equal.
Define a custom lt comparison for this.

let
  function lt(x::UnitRange{T}, y::UnitRange{T}) where {T}
    fx, fy = first(x), first(y)
    return fx == fy ? last(x) < last(y) : fx < fy
  end
  for v in values(refrngvecs)
    sort!(v; lt)
  end
end
refrngvecs  # note changes in refrngvecs[:chr01]

Dict{Symbol, Vector{UnitRange{Int32}}} with 24 entries:
  :chr21 => [5011799:5011874, 5012548:5012687, 5014386:5014471, 5016935:5017145…
  :chr15 => [19878555:19878668, 19878831:19879004, 19881201:19881307, 19882277:…
  :chr10 => [14061:14299, 14138:14299, 14497:14604, 16502:16544, 44712:44901, 4…
  :chr17 => [64099:65736, 65830:65887, 71366:71556, 75814:75878, 76723:76866, 8…
  :chr07 => [12704:12822, 19018:19172, 19619:19895, 20834:21029, 24314:24365, 2…
  :chr14 => [16057472:16057622, 18333726:18333900, 18333826:18333896, 18337973:…
  :chr08 => [64091:64175, 64269:64320, 72601:72673, 72617:72701, 78905:79244, 7…
  :chr12 => [12310:12358, 12740:12824, 13102:13201, 13370:13501, 14522:14944, 1…
  :chr18 => [11103:11595, 11191:11595, 13152:13354, 14195:14653, 14490:14653, 1…
  :chrX  => [253743:253846, 254937:255091, 276322:276394, 276324:276394, 276353…
  :chr13 => [18174010:18174103, 18174442:18174512, 18176018:18176170, 18177555:…
  :chr11 => [75780:76143, 86649:87586, 112967:113111, 113116:113174, 121258:121…
  :chr22 => [10736171:10736283, 10939388:10939423, 10940597:10940707, 10941691:…
  :chr03 => [23757:23812, 23968:24501, 53348:53692, 54293:54346, 195758:195914,…
  :chr19 => [60951:61894, 62113:66524, 63821:64213, 65051:65226, 65822:66047, 6…
  :chr05 => [58198:58915, 92151:92276, 113251:113448, 139483:140716, 140258:140…
  :chr06 => [95124:95454, 105919:106856, 131910:132117, 140211:140379, 142272:1…
  :chr20 => [87250:87359, 87710:87767, 96005:96533, 96005:97094, 142369:142686,…
  :chrY  => [2784749:2784853, 2786855:2787699, 2789827:2790328, 2827982:2828218…
  :chr04 => [49096:49956, 49554:50124, 53285:53491, 53286:53491, 53295:53491, 5…
  :chr02 => [38814:41627, 41220:41627, 41221:41627, 42809:42952, 45440:46385, 4…
  :chr01 => [11869:12227, 12010:12057, 12179:12227, 12613:12697, 12613:12721, 1…
  :chr09 => [12134:12190, 12291:12340, 12726:12834, 13088:13157, 13338:13487, 1…
  :chr16 => [11555:11908, 11861:11908, 12294:12378, 12294:12402, 12663:12733, 1…

let blocks

A let/end block provides a local namespace. The custom lt comparison method will not be visible outside the scope of the block.

We will use the ranges on chromsome 1 for our timing benchmarks. The target for tests of intersection with a single target range will be the last range on chromosome 1 in “ex-rna.arrow”.

refrngvec01 = refrngvecs[:chr01]
tarrngvec01 = tarrngvecs[:chr01]
target = last(tarrngvec01)

182839365:182887745

Intersecting reference ranges with a target

Create a method to intersect a target range with a Vector{UnitRange} returning the intersections as another Vector{UnitRange}.
A common Julia programming idiom for such cases is to create two methods: one for intersect!, which modifies an argument that will hold the result, and one for intersect, which allocates the result then calls the intersect! method.
For the intersect! method we first empty! the result vector, which zeros the vector length but does not shrink the memory chunk allocated to the vector, then push! intersections onto it. If there are many calls to an intersect! method like this only a few will cause memory allocations.
The elements of our vectors of reference ranges, like refrngvec01, are sorted by their first element but globally the last elements do not have any special relationships. The best we can do to determine all intersections is a sequential scan of the sorted vector of ranges with an early exit when last(target) < first(vectorelement).

function Base.intersect!(
  result::Vector{UnitRange{T}},
  target::AbstractUnitRange{<:Integer},
  refs::Vector{UnitRange{T}},
) where {T}
  empty!(result)
  target = UnitRange{T}(target)  # coerce the type, if necessary
  lastt = last(target)
  for rng in refs
    lastt < first(rng) && break
    isect = intersect(target, rng)
    !isempty(isect) && push!(result, isect)
  end
  return result
end
function Base.intersect(
  target::AbstractUnitRange{<:Integer},
  refs::Vector{UnitRange{T}},
) where {T}
  return intersect!(similar(refs, 0), target, refs)
end
savedresult = intersect(target, refrngvec01)  # will use for comparisons

44-element Vector{UnitRange{Int32}}:
 182839369:182839456
 182839734:182839935
 182842545:182842677
 182843294:182843434
 182852233:182852344
 182853306:182853418
 182854030:182854142
 182854030:182854178
 182854041:182854178
 182855577:182855713
 182856532:182856572
 182856532:182856578
 182858104:182858240
 ⋮
 182880497:182880608
 182881264:182881425
 182881285:182881425
 182881520:182881645
 182881520:182881647
 182883139:182883216
 182883139:182883368
 182883320:182883368
 182883520:182883635
 182884613:182884813
 182887083:182887745
 182887083:182887745

result = similar(refrngvec01, 0)  # 0-length vector of same type
isequal(intersect!(result, target, refrngvec01), savedresult)

true

Dictionaries of RangeTrees and IntervalTrees

RangeTrees

RangeTrees.jl provides an implementation of interval trees using the augmented binary tree formulation.
Because the tree is represented by its root node, there is no RangeTree type or constructor, only a RangeNode.

refrngtrees = Dict(k => RangeNode(v) for (k, v) in refrngvecs)
rangetree01 = refrngtrees[:chr01]
treesize(rangetree01), treeheight(rangetree01), treebreadth(rangetree01)

(52789, 15, 20022)

print_tree(rangetree01; maxdepth = 3)

(110172080:110172489, 248937043)
├─ (37858488:37858539, 110172217)
│  ├─ (19100078:19100126, 37857709)
│  │  ├─ (7745845:7746091, 19099677)
│  │  │  ⋮
│  │  │  
│  │  └─ (27615689:27615844, 37857709)
│  │     ⋮
│  │     
│  └─ (62834038:62834116, 110172217)
│     ├─ (46194582:46194651, 62829176)
│     │  ⋮
│     │  
│     └─ (89633140:89633322, 110172217)
│        ⋮
│        
└─ (169855796:169855957, 248937043)
   ├─ (153545753:153545802, 169854964)
   │  ├─ (145901516:145901664, 153545518)
   │  │  ⋮
   │  │  
   │  └─ (157519743:157519770, 169854964)
   │     ⋮
   │     
   └─ (208028830:208030303, 248937043)
      ├─ (196793317:196793633, 208029042)
      │  ⋮
      │  
      └─ (228147615:228147699, 248937043)
         ⋮

Augmented interval trees

A smaller example may help to understand how this type of interval tree. Consider the first 7 UnitRanges in refrngvecs[:chr01]

smallrngvec = refrngvecs[:chr01][1:15]

15-element Vector{UnitRange{Int32}}:
 11869:12227
 12010:12057
 12179:12227
 12613:12697
 12613:12721
 12975:13052
 13221:13374
 13221:14409
 13453:13670
 14404:14501
 15005:15038
 15796:15947
 16607:16765
 16858:17055
 17233:17368

rn = RangeNode(smallrngvec)
print_tree(rn)

(13221:14409, 17368)
├─ (12613:12697, 13374)
│  ├─ (12010:12057, 12227)
│  │  ├─ (11869:12227, 12227)
│  │  └─ (12179:12227, 12227)
│  └─ (12975:13052, 13374)
│     ├─ (12613:12721, 12721)
│     └─ (13221:13374, 13374)
└─ (15796:15947, 17368)
   ├─ (14404:14501, 15038)
   │  ├─ (13453:13670, 13670)
   │  └─ (15005:15038, 15038)
   └─ (16858:17055, 17368)
      ├─ (16607:16765, 16765)
      └─ (17233:17368, 17368)

The UnitRange in the root node is the 7th out of the 15 sorted ranges from which the tree was constructed.
Each of the nodes in the tree can have 0, 1, or 2 child nodes. Those with 0 children are called the “leaves” of the tree.

collect(Leaves(rn))

8-element Vector{RangeNode{Int32, Int32}}:
 (11869:12227, 12227)
 (12179:12227, 12227)
 (12613:12721, 12721)
 (13221:13374, 13374)
 (13453:13670, 13670)
 (15005:15038, 15038)
 (16607:16765, 16765)
 (17233:17368, 17368)

In addition to the UnitRange it represents, each RangeNode stores maxlast, the maximum value of last(rng) for any UnitRange in the tree rooted at this node. For a leaf maxlast is simply last of its UnitRange.
For other nodes, maxlast can be larger than last of its UnitRange.
In particular, for the root node maxlast is the maximum of all the last values of the UnitRanges that generated the tree.

maximum(last.(smallrngvec))

RangeTrees.jl defines methods for generics like children, getroot, and nodevalue from AbstractTrees.jl and these allow for many other generics to be applied to a RangeNode.

children(rangetree01)

2-element Vector{RangeNode{Int32, Int32}}:
 (37858488:37858539, 110172217)
 (169855796:169855957, 248937043)

The root of the tree can be obtained from any node using getroot. The combination of getroot and children allows traversal of the tree.

getroot(first(children(first(children(rangetree01))))) # get the root from its grandchild

(110172080:110172489, 248937043)

Several methods for generic functions are exported from RangeTrees.jl

varinfo(RangeTrees)

name	size	summary
Leaves	40 bytes	UnionAll
PostOrderDFS	40 bytes	UnionAll
PreOrderDFS	80 bytes	UnionAll
RangeNode	80 bytes	UnionAll
RangeTrees	347.087 KiB	Module
children	0 bytes	children (generic function with 11 methods)
getroot	0 bytes	getroot (generic function with 3 methods)
intersect!	0 bytes	intersect! (generic function with 12 methods)
isroot	0 bytes	isroot (generic function with 3 methods)
midrange	0 bytes	midrange (generic function with 2 methods)
nodetype	0 bytes	nodetype (generic function with 3 methods)
nodevalue	0 bytes	nodevalue (generic function with 6 methods)
print_tree	0 bytes	print_tree (generic function with 3 methods)
splitrange	0 bytes	splitrange (generic function with 2 methods)
treebreadth	0 bytes	treebreadth (generic function with 1 method)
treeheight	0 bytes	treeheight (generic function with 1 method)
treesize	0 bytes	treesize (generic function with 1 method)

but most of these are defined in AbstractTrees.jl.

Evaluating and storing maxlast in the nodes allows for overlap searchs to be truncated at nodes for which maxlast < first(target). The sorting by first allows for skipping the right subtree whenever last(target) < first(thisnode) (as in the intersect! method for Vector{UnitRange}).
intersect and intersect! methods are already defined in RangeTrees.jl.
Check that their results agree with the saved result.

isequal(intersect(target, rangetree01), savedresult)

true

isequal(intersect!(result, target, rangetree01), savedresult)

true

IntervalTrees

Create a dictionary of IntervalTrees. It is somewhat tedious to get the type of the result correct and we create a function to hide the details.

function toitrees(rngdict::Dict{S,Vector{UnitRange{T}}}) where {S,T}
  return Dict(
    k => IntervalTree{T,Interval{T}}(Interval.(v)) for (k, v) in rngdict
  )
end
refintvltrees = toitrees(refrngvecs)
intvltree01 = refintvltrees[:chr01]
show(intvltree01)

IntervalTree{Int32, Interval{Int32}}
(11869,12227)
(12010,12057)
(12179,12227)
⋮
(248917279,248917401)
(248917279,248919946)
(248936581,248937043)

Creating an intersect! method is also tedious because the package has its own Interval data type and defines intersect(itr::IntervalTree, (frst, lst)) to return an iterator of Intervals in the tree, not the intersection

Interval(target)

Interval{Int32}
(182839365,182887745)

function Base.intersect!(
  res::Vector{UnitRange{T}},
  target::AbstractUnitRange,
  refs::IntervalTree{T},
) where {T}
  empty!(res)
  firstt, lastt = first(target), last(target)
  for isect in intersect(refs, (firstt, lastt))
    push!(res, max(first(isect), firstt):min(last(isect), lastt))
  end
  return res
end
isequal(intersect!(result, target, intvltree01), savedresult)

true

Time for a shootout

Vector{UnitRange}

@benchmark intersect!(res, tar, ref) setup =
  (res = result; tar = target; ref = refrngvec01)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  32.036 μs …  58.521 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     33.893 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.070 μs ± 914.027 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  ▁▂▁▁   ▂▄▃▄▄▂    ▆██▅    ▆▇▆▂          ▁       ▂             ▂
  ████▁▁▁██████▁▃██████▆▄▃▇████▆▅▆▇▇██████▇▆▃▄▃▃▁█▁▆▇▆▅▁▄▆▅▃▃▄ █
  32 μs         Histogram: log(frequency) by time      37.8 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark intersect(tar, ref) setup = (tar = target; ref = refrngvec01)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  32.079 μs … 61.796 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.389 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.767 μs ±  1.885 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
        ▆     █       ▁                                        
  ▁▂▁▁▁▁█▅▃▃▂▂█▆█▅▂▂▄▅█▆▂▂▂▃▅▃▂▂▂▂▂▁▁▁▇▄▁▁▁▁▂█▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁ ▂
  32.1 μs         Histogram: frequency by time        39.5 μs <
 Memory estimate: 1.92 KiB, allocs estimate: 4.

RangeTree

@benchmark intersect!(res, tar, ref) setup =
  (res = result; tar = target; ref = rangetree01)

BenchmarkTools.Trial: 10000 samples with 202 evaluations.
 Range (min … max):  386.926 ns … 769.124 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     409.109 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   411.822 ns ±  14.841 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
                     ▇█▃       ▂                                 
  ▁▁▂▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▅████▆▅▃▂▃▆██▅▄▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  387 ns           Histogram: frequency by time          446 ns <
 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark intersect(tar, ref) setup = (tar = target; ref = rangetree01)

BenchmarkTools.Trial: 10000 samples with 197 evaluations.
 Range (min … max):  459.046 ns … 49.638 μs  ┊ GC (min … max):  0.00% … 98.68%
 Time  (median):     840.467 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   847.085 ns ±  2.332 μs  ┊ GC (mean ± σ):  13.48% ±  4.82%
    ▁▃                                               ██▅▁       
  ▅▇██▆▃▂▂▂▂▂▂▂▂▁▂▁▂▁▂▂▂▁▂▁▁▁▂▁▁▂▂▁▁▂▂▁▁▁▁▁▂▂▂▂▂▂▂▂▃▇████▆▄▃▂▂ ▃
  459 ns          Histogram: frequency by time          901 ns <
 Memory estimate: 1.92 KiB, allocs estimate: 4.

IntervalTree

@benchmark intersect!(res, tar, ref) setup =
  (res = result; tar = target; ref = intvltree01)

BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.095 μs …  2.577 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.171 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.173 μs ± 41.586 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
                ▂▄▇█▇▆▁                                       
  ▁▁▂▂▂▃▃▄▃▄▄▄▆▇███████▆▄▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.1 μs         Histogram: frequency by time        1.35 μs <
 Memory estimate: 80 bytes, allocs estimate: 3.

In what follows we will use the RangeTree method for intersect! to obtain the intersections.

Determining coverage

The coverage of a target by a set of reference ranges is the proportion of the base pairs in the target that intersect with one or more of the reference ranges.
We need to somehow count the number of elements in the union of the ranges returned from intersect!.
This could be done using the union! method for a BitSet but that approach has two problems: it is comparatively slow and, in Julia versions up to 1.8.0-rc1, it can be wrong.

Bug in union! for BitSet

While creating these notes we discovered a bug in the union! method for a BitSet and a UnitRange.
There was a PR to fix it the next morning.
Versions of Julia prior to 1.8.0-rc2 can (and probably will) return incorrect values of coverage.
Replacing union!(bs, isect) by union!(bs, BitSet(isect)) avoids this “infelicity” at the expense of more memory usage and compute time.

There is a better method that takes advantage of the intersecting ranges being sorted by first
The idea is to “keep moving the goalposts”. When evaluating the coverage count, add the length of the current reference range’s intersection with only the part to the right of what has already been covered. The thing we know about the intersecting ranges is that each successive ranges’s first position is greater than or equal to the first position of all the ranges preceding it. That is, the ranges can’t “move left” as we iterate through them.

function coveragecount(
  target::AbstractUnitRange,
  isects::Vector{UnitRange{T}},
) where {T}
  leftpost, rightpost = T(first(target)), T(last(target))
  coverage = 0
  for isect in isects
    coverage += length(intersect(isect, leftpost:rightpost))
    leftpost = max(leftpost, last(isect) + one(T))
  end
  return coverage
end
coveragecount(target, savedresult)

Iterating over a collection of targets

function overlaps(targets::Vector{UnitRange{T}}, refs::RangeNode{T}) where {T}
  nover = similar(targets, Int)
  covcount = similar(targets, T)
  result = empty!(similar(targets, ceil(Int, sqrt(treesize(refs)))))
  @inbounds for (i, tar) in enumerate(targets)
    nover[i] = length(intersect!(result, tar, refs))
    covcount[i] = coveragecount(tar, result)
  end
  DataFrame(targets = targets, nover = nover, covcount = covcount)
end
overlaps(tarrngvec01, rangetree01)

453,114 rows × 3 columns

	targets	nover	covcount
	UnitRang…	Int64	Int32
1	165655371:165655620	3	250
2	84498368:84506172	6	676
3	44777749:44778638	11	837
4	23692611:23696418	9	1398
5	23692608:23696418	9	1401
6	23962516:23963450	1	935
7	23692686:23696419	9	1324
8	114717978:114720613	15	1127
9	116392931:116404775	19	2762
10	25567642:25568695	2	1054
11	23961575:23962496	2	922
12	8861023:8870500	14	1777
13	96678807:96679446	1	573
14	37536543:37554276	12	4139
15	155308882:155320663	56	5081
16	65230872:65232127	2	1256
17	23692609:23696399	9	1381
18	181089237:181090826	1	1590
19	185118995:185120707	4	1602
20	203861621:203870081	15	987
21	153990779:153992127	10	903
22	42658437:42676757	19	1431
23	2309508:2310113	1	606
24	173863895:173866795	42	2631
25	209768622:209782307	33	6936
26	45511237:45521873	15	2936
27	8861009:8861287	2	279
28	43974941:43978232	32	3051
29	89052317:89052623	1	305
30	109593271:109593512	1	242
⋮	⋮	⋮	⋮

@benchmark overlaps(targets, refs) setup =
  (targets = tarrngvec01; refs = rangetree01)

BenchmarkTools.Trial: 40 samples with 1 evaluation.
 Range (min … max):  123.388 ms … 134.102 ms  ┊ GC (min … max): 0.00% … 5.07%
 Time  (median):     125.436 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   126.632 ms ±   2.752 ms  ┊ GC (mean ± σ):  0.80% ± 1.85%
          ▃▁▁█           ▃                                       
  ▄▁▁▄▁▁▇▇████▇▁▇▁▄▁▄▁▁▁▄█▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▄▄▄▄▁▁▁▁▁▁▄ ▁
  123 ms           Histogram: frequency by time          134 ms <
 Memory estimate: 26.23 MiB, allocs estimate: 158533.

The whole shootin’ match

The computations on different chromosome are independent of each other and can be assigned to different threads when Julia is started with multiple threads.

function overlaps(
  tardict::Dict{Symbol,Vector{UnitRange{T}}},
  refdict::Dict{Symbol,RangeNode{T,R}},
) where {T,R}
  matchedkeys = intersect(keys(tardict), keys(refdict))
  result = Dict(k => DataFrame() for k in matchedkeys) # pre-assign key/value pairs
  @sync for k in matchedkeys
    Threads.@spawn result[k] = overlaps(tardict[k], refdict[k])
  end
  return result
end
bigresult = overlaps(tarrngvecs, refrngtrees);
bigresult[:chrX]

150,330 rows × 3 columns

	targets	nover	covcount
	UnitRang…	Int64	Int32
1	72272603:72277210	15	1955
2	119786512:119791605	7	2123
3	77826515:77829079	3	2565
4	79241981:79363480	12	2619
5	12975122:12977227	8	1691
6	119786513:119791613	7	2130
7	12975119:12977227	8	1694
8	24465328:24534418	16	2115
9	12976909:12977217	3	309
10	154400549:154400875	12	327
11	12976881:12977227	3	347
12	65738963:65741915	2	2412
13	154398500:154400899	27	1950
14	119786512:119791640	7	2158
15	24077771:24077957	1	187
16	12975119:12977224	8	1691
17	12976265:12977227	5	548
18	12975128:12977212	8	1670
19	12976915:12977223	3	309
20	12975126:12977227	8	1687
21	78049074:78049922	2	849
22	77899484:77905383	7	988
23	1386273:1386685	1	413
24	75053280:75076550	24	2163
25	119584161:119584421	2	261
26	119786525:119791615	7	2120
27	72272612:72275658	11	1784
28	153907379:153913831	38	2727
29	12976976:12977225	3	250
30	15384819:15491231	15	2184
⋮	⋮	⋮	⋮

Pre-assigning dictionary entries to avoid thread conflict

The result is initialized with all the keys that will be in the result and with values that are empty DataFrames.
Individual tasks that are @spawned allocate their DataFrame value and merely update the pointer in result to that value, eliminating the possibility of collisions with other tasks.

@benchmark overlaps(targets, refs) setup =
  (targets = tarrngvecs; refs = refrngtrees)

BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.304 s …  1.319 s  ┊ GC (min … max): 0.63% … 0.69%
 Time  (median):     1.314 s             ┊ GC (median):    0.66%
 Time  (mean ± σ):   1.313 s ± 6.888 ms  ┊ GC (mean ± σ):  0.67% ± 0.08%
  █                        █                  █          █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ ▁
  1.3 s         Histogram: frequency by time        1.32 s <
 Memory estimate: 278.21 MiB, allocs estimate: 1724535.

Conclusions

The ability to work in the REPL (or VS Code or Jupyter notebooks) encourages iterative refinement of algorithms.
“Trust but verify” - when making a change or introducing new methods it helps to have results from previous methods for comparison. In general, continuous integration (CI) testing is straightforward for Julia packages and is strongly encouraged.
There are many tools for benchmarking function execution or storage allocation, allowing a developer to concentrate on where the “real problem” is.
In certain cases, enhancements like multi-threading can be achieved with very little effort.

Version information

versioninfo()

Julia Version 1.8.0-rc3
Commit 33f19bcbd25 (2022-07-13 19:10 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores