2b: Determining Interval Overlap

Author

Douglas Bates and Claudia Solis-Lemus

Published

2022-07-30

Load packages to be used

Code
using Arrow          # Arrow storage and file format
using BenchmarkTools # tools for benchmarking code
using DataFrames     # versatile tabular data format
using IntervalTrees  # interval trees from BioJulia
using RangeTrees     # a bespoke implementation of interval trees
using Tables         # row- or column-oriented tabular data

using Base: intersect! # not exported from Base

datadir = joinpath(@__DIR__, "biofast-data-v1");

Strategy for computing overlaps

  • Split the data in the Arrow data tables into a dictionary where the keys are the chromosome tag and values are the start-stop pairs expressed as a Vector{UnitRange}. For the reference ranges, sort the vectors by increasing first value.

  • For the reference ranges create two other dictionaries with values from RangeTrees.jl and from IntervalTrees.jl, respectively.

  • Benchmark the intersection with a target UnitRange from each of the representations of the reference ranges. Do this in two ways: intersect, which allocates the storage for the result, and intersect!, which modifies one of its arguments with the result.

  • Compute the coverage from the vector of intersections.

  • Apply the methods to the complete set of targets.

Creating dictionaries of Vector{UnitRange}

  • A UnitRange, such as 2:10, includes the end points, accessed by first and last methods.
typeof(2:10), length(2:10), first(2:10), last(2:10)
(UnitRange{Int64}, 9, 2, 10)
  • However, the positions in the start and stop columns in a .bed file are not both included in the range they represent. The positions correspond to base pairs in the range start:(stop - 1) as 0-based indices on the chromosome or (start + 1):stop as 1-based indices.

  • It doesn’t matter which one we use as long as we are consistent in creating ranges from the reference intervals and the target intervals.

  • We choose to start counting from 1, just as the world’s foremost expert on counting does.

  • We wrap this conversion in a utility function to help ensure consistency.

asrange(start, stop) = (start+one(start)):stop
asrange (generic function with 1 method)

This method definition uses the compact “one-liner” form, like the math notation f(x) = x + 1.

one(x) is used instead of the literal 1 in asrange to preserve the integer type (see also ?oneunit, which is slighly more general).

st = Int32(2314)
typeof(st + 1)       # type gets promoted to Int64
Int64
typeof(st + one(st)) # type not promoted
Int32
  • Create a utility, chromodict to read an Arrow.Table and convert it to a Dict{Symbol, Vector{UnitRange{T}}}, assuming that the table contains columns named :chromo, :start, and :stop.

  • We define two methods, one that takes a “row-table”, which is a vector of named tuples, and one that takes the file name of the arrow file, reads the (column oriented)table, converts it to a row table, and then calls the first method.

function chromodict(rtbl::Vector{<:NamedTuple})
  r1 = first(rtbl)
  itype = promote_type(typeof(r1.start), typeof(r1.stop))
  vtype = Vector{UnitRange{itype}}
  dict = Dict{Symbol,vtype}()
  for (; chromo, start, stop) in rtbl
    push!(get!(dict, Symbol(chromo), vtype()), asrange(start, stop))
  end
  return dict
end
function chromodict(fnm::AbstractString)
  return chromodict(rowtable(Arrow.Table(joinpath(datadir, fnm))))
end
tarrngvecs = chromodict("ex-rna.arrow")
refrngvecs = chromodict("ex-anno.arrow")
Dict{Symbol, Vector{UnitRange{Int32}}} with 24 entries:
  :chr21 => [5011799:5011874, 5012548:5012687, 5014386:5014471, 5016935:5017145…
  :chr15 => [19878555:19878668, 19878831:19879004, 19881201:19881307, 19882277:…
  :chr10 => [14497:14604, 14061:14299, 16502:16544, 14138:14299, 44712:44901, 4…
  :chr17 => [76723:76866, 75814:75878, 71366:71556, 65830:65887, 64099:65736, 1…
  :chr07 => [12704:12822, 26965:27199, 24314:24365, 26965:27234, 31060:31194, 2…
  :chr14 => [16057472:16057622, 18333726:18333900, 18337973:18338078, 18338243:…
  :chr08 => [64269:64320, 64091:64175, 72601:72673, 78905:79775, 72617:72701, 7…
  :chr12 => [12310:12358, 12740:12824, 13102:13201, 13370:13501, 31878:32015, 2…
  :chr18 => [11103:11595, 15617:15822, 11191:11595, 13152:13354, 15617:15928, 1…
  :chrX  => [253743:253846, 254937:255091, 276322:276394, 281482:281684, 284167…
  :chr13 => [18177555:18178465, 18176018:18176170, 18174442:18174512, 18174010:…
  :chr11 => [75780:76143, 86649:87586, 125578:125927, 121258:121426, 113116:113…
  :chr22 => [10736171:10736283, 10961283:10961338, 10959067:10959136, 10950049:…
  :chr03 => [23757:23812, 23968:24501, 54293:54346, 53348:53692, 196607:196859,…
  :chr19 => [70928:70976, 66346:66499, 60951:61894, 62113:66524, 70928:70951, 6…
  :chr05 => [58198:58915, 92151:92276, 113251:113448, 139483:140716, 143047:143…
  :chr06 => [95124:95454, 105919:106856, 144536:144885, 140211:140379, 131910:1…
  :chr20 => [87250:87359, 96005:97094, 87710:87767, 96005:96533, 142369:142686,…
  :chrY  => [2784749:2784853, 2786855:2787699, 2789827:2790328, 2827982:2828218…
  :chr04 => [49096:49956, 49554:50124, 53285:53491, 59430:59556, 60058:60153, 8…
  :chr02 => [45440:46385, 38814:41627, 42809:42952, 41220:41627, 46807:46870, 4…
  :chr01 => [11869:12227, 12613:12721, 13221:14409, 12010:12057, 12179:12227, 1…
  :chr09 => [12134:12190, 12291:12340, 12726:12834, 13088:13157, 13338:13487, 1…
  :chr16 => [11555:11908, 12294:12402, 12902:14090, 11861:11908, 12294:12378, 1…

The call get!(dict, Symbol(chromo), vtype()) in chromodict returns dict[Symbol(chromo)] or the default value, which is an empty Vector{UnitRange{T}}. For the case of the default, it also installs that key/value pair in dict.

We use Symbols for the keys in these Dicts because they are easier to type and because symbol table lookup is very fast, although that doesn’t really matter when we only have 24 distinct keys.

The expression for (; chromo, start, stop) in rtbl is equivalent to three local assignments

for r in rtbl
  chromo = r.chromo
  start = r.start
  stop = r.stop
  ...
end

In other words, it is equivalent to taking the named fields of a NamedTuple or a struct and making them available in the local namespace under the same names.

Tables.jl provides interfaces for row- and column-oriented tables, allowing for one orientation to be viewed as the other. In this case an Arrow.Table is column-oriented but rowtable of this table returns a Vector{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}} to iterate over the rows. When a “method instance” of chromodict is compiled for such a table, the schema of the rows will be known.

@code_warntype chromodict(
  rowtable(Arrow.Table("biofast-data-v1/ex-anno.arrow")),
)
MethodInstance for chromodict(::
Vector{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}})
  from chromodict(rtbl::Vector{<:NamedTuple}) in Main at In[7]:1
Arguments
  #self#::Core.Const(chromodict)
  rtbl::Vector{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}}
Locals
  @_3::Union{Nothing, Tuple{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}, Int64}}
  dict::Dict{Symbol, Vector{UnitRange{Int32}}}
  vtype::Type{Vector{UnitRange{Int32}}}
  itype::Type{Int32}
  r1::NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}
  stop::Int32
  start::Int32
  chromo::String
Body::Dict{Symbol, Vector{UnitRange{Int32}}}
1 ─       
(r1 = Main.first(rtbl))

│   %2  = Base.getproperty(r1, :start)::Int32
│   %3  = Main.typeof(%2)::Core.Const(Int32)
│   %4  = Base.getproperty(r1, :stop)::Int32
│   %5  = Main.typeof(%4)::Core.Const(Int32)
│         (itype = Main.promote_type(%3, %5))
│   %7  = Core.apply_type(Main.UnitRange, itype
::Core.Const(Int32))::Core.Const(UnitRange{Int32})
│         (vtype = Core.apply_type(Main.Vector, %7))
│   %9  = Core.apply_type(Main.Dict, Main.Symbol, vtype::Core.Const(Vector{UnitRange{Int32}}))::Core.Const(Dict{Symbol, Vector{UnitRange{Int32}}})
│         (dict = (%9)())
│   %11 = rtbl::Vector{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}}
│         (@_3 = Base.iterate(%11))
│   %13 = (@_3 === nothing)::Bool
│   %14 = Base.not_int(%13)::Bool
└──       goto #4 if not %14
2 ┄ %16 = @_3::Tuple{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}, Int64}
│   %17 = Core.getfield(%16, 1)::NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}
│         (chromo = Base.getproperty(%17, :chromo))
│         (start = Base.getproperty(%17, :start))
│         (stop = Base.getproperty(%17, :stop))
│   %21 = Core.getfield(%16, 2)::Int64
│   %22 = dict::Dict{Symbol, Vector{UnitRange{Int32}}}
│   %23 = Main.Symbol(chromo)::Symbol
│   %24 = (vtype::Core.Const(Vector{UnitRange{Int32}}))()::Vector{UnitRange{Int32}}
│   %25 = Main.get!(%22, %23, %24)::Vector{UnitRange{Int32}}
│   %26 = Main.asrange(start, stop)::UnitRange{Int32}
│         Main.push!(%25, %26)
│         (@_3 = Base.iterate(%11, %21))
│   %29 = (@_3 === nothing)::Bool
│   %30 = Base.not_int(%29)::Bool
└──       goto #4 if not %30
3 ─       goto #2
4 ┄       return dict
  • In refrngvecs, each of the values, a Vector{UnitRange}, should be sorted by the first element of the UnitRange.

  • Creating an IntervalTree requires the ranges to be sorted by first element and by last element when the first elements are equal.

  • Define a custom lt comparison for this.

let
  function lt(x::UnitRange{T}, y::UnitRange{T}) where {T}
    fx, fy = first(x), first(y)
    return fx == fy ? last(x) < last(y) : fx < fy
  end
  for v in values(refrngvecs)
    sort!(v; lt)
  end
end
refrngvecs  # note changes in refrngvecs[:chr01]
Dict{Symbol, Vector{UnitRange{Int32}}} with 24 entries:
  :chr21 => [5011799:5011874, 5012548:5012687, 5014386:5014471, 5016935:5017145…
  :chr15 => [19878555:19878668, 19878831:19879004, 19881201:19881307, 19882277:…
  :chr10 => [14061:14299, 14138:14299, 14497:14604, 16502:16544, 44712:44901, 4…
  :chr17 => [64099:65736, 65830:65887, 71366:71556, 75814:75878, 76723:76866, 8…
  :chr07 => [12704:12822, 19018:19172, 19619:19895, 20834:21029, 24314:24365, 2…
  :chr14 => [16057472:16057622, 18333726:18333900, 18333826:18333896, 18337973:…
  :chr08 => [64091:64175, 64269:64320, 72601:72673, 72617:72701, 78905:79244, 7…
  :chr12 => [12310:12358, 12740:12824, 13102:13201, 13370:13501, 14522:14944, 1…
  :chr18 => [11103:11595, 11191:11595, 13152:13354, 14195:14653, 14490:14653, 1…
  :chrX  => [253743:253846, 254937:255091, 276322:276394, 276324:276394, 276353…
  :chr13 => [18174010:18174103, 18174442:18174512, 18176018:18176170, 18177555:…
  :chr11 => [75780:76143, 86649:87586, 112967:113111, 113116:113174, 121258:121…
  :chr22 => [10736171:10736283, 10939388:10939423, 10940597:10940707, 10941691:…
  :chr03 => [23757:23812, 23968:24501, 53348:53692, 54293:54346, 195758:195914,…
  :chr19 => [60951:61894, 62113:66524, 63821:64213, 65051:65226, 65822:66047, 6…
  :chr05 => [58198:58915, 92151:92276, 113251:113448, 139483:140716, 140258:140…
  :chr06 => [95124:95454, 105919:106856, 131910:132117, 140211:140379, 142272:1…
  :chr20 => [87250:87359, 87710:87767, 96005:96533, 96005:97094, 142369:142686,…
  :chrY  => [2784749:2784853, 2786855:2787699, 2789827:2790328, 2827982:2828218…
  :chr04 => [49096:49956, 49554:50124, 53285:53491, 53286:53491, 53295:53491, 5…
  :chr02 => [38814:41627, 41220:41627, 41221:41627, 42809:42952, 45440:46385, 4…
  :chr01 => [11869:12227, 12010:12057, 12179:12227, 12613:12697, 12613:12721, 1…
  :chr09 => [12134:12190, 12291:12340, 12726:12834, 13088:13157, 13338:13487, 1…
  :chr16 => [11555:11908, 11861:11908, 12294:12378, 12294:12402, 12663:12733, 1…

A let/end block provides a local namespace. The custom lt comparison method will not be visible outside the scope of the block.

  • We will use the ranges on chromsome 1 for our timing benchmarks. The target for tests of intersection with a single target range will be the last range on chromosome 1 in “ex-rna.arrow”.
refrngvec01 = refrngvecs[:chr01]
tarrngvec01 = tarrngvecs[:chr01]
target = last(tarrngvec01)
182839365:182887745

Intersecting reference ranges with a target

  • Create a method to intersect a target range with a Vector{UnitRange} returning the intersections as another Vector{UnitRange}.

  • A common Julia programming idiom for such cases is to create two methods: one for intersect!, which modifies an argument that will hold the result, and one for intersect, which allocates the result then calls the intersect! method.

  • For the intersect! method we first empty! the result vector, which zeros the vector length but does not shrink the memory chunk allocated to the vector, then push! intersections onto it. If there are many calls to an intersect! method like this only a few will cause memory allocations.

  • The elements of our vectors of reference ranges, like refrngvec01, are sorted by their first element but globally the last elements do not have any special relationships. The best we can do to determine all intersections is a sequential scan of the sorted vector of ranges with an early exit when last(target) < first(vectorelement).

function Base.intersect!(
  result::Vector{UnitRange{T}},
  target::AbstractUnitRange{<:Integer},
  refs::Vector{UnitRange{T}},
) where {T}
  empty!(result)
  target = UnitRange{T}(target)  # coerce the type, if necessary
  lastt = last(target)
  for rng in refs
    lastt < first(rng) && break
    isect = intersect(target, rng)
    !isempty(isect) && push!(result, isect)
  end
  return result
end
function Base.intersect(
  target::AbstractUnitRange{<:Integer},
  refs::Vector{UnitRange{T}},
) where {T}
  return intersect!(similar(refs, 0), target, refs)
end
savedresult = intersect(target, refrngvec01)  # will use for comparisons
44-element Vector{UnitRange{Int32}}:
 182839369:182839456
 182839734:182839935
 182842545:182842677
 182843294:182843434
 182852233:182852344
 182853306:182853418
 182854030:182854142
 182854030:182854178
 182854041:182854178
 182855577:182855713
 182856532:182856572
 182856532:182856578
 182858104:182858240
 ⋮
 182880497:182880608
 182881264:182881425
 182881285:182881425
 182881520:182881645
 182881520:182881647
 182883139:182883216
 182883139:182883368
 182883320:182883368
 182883520:182883635
 182884613:182884813
 182887083:182887745
 182887083:182887745
result = similar(refrngvec01, 0)  # 0-length vector of same type
isequal(intersect!(result, target, refrngvec01), savedresult)
true

Dictionaries of RangeTrees and IntervalTrees

RangeTrees

refrngtrees = Dict(k => RangeNode(v) for (k, v) in refrngvecs)
rangetree01 = refrngtrees[:chr01]
treesize(rangetree01), treeheight(rangetree01), treebreadth(rangetree01)
(52789, 15, 20022)
print_tree(rangetree01; maxdepth = 3)
(110172080:110172489, 248937043)
├─ (37858488:37858539, 110172217)
│  ├─ (19100078:19100126, 37857709)
│  │  ├─ (7745845:7746091, 19099677)
│  │  │  ⋮
│  │  │  
│  │  └─ (27615689:27615844, 37857709)
│  │     ⋮
│  │     
│  └─ (62834038:62834116, 110172217)
│     ├─ (46194582:46194651, 62829176)
│     │  ⋮
│     │  
│     └─ (89633140:89633322, 110172217)
│        ⋮
│        
└─ (169855796:169855957, 248937043)
   ├─ (153545753:153545802, 169854964)
   │  ├─ (145901516:145901664, 153545518)
   │  │  ⋮
   │  │  
   │  └─ (157519743:157519770, 169854964)
   │     ⋮
   │     
   └─ (208028830:208030303, 248937043)
      ├─ (196793317:196793633, 208029042)
      │  ⋮
      │  
      └─ (228147615:228147699, 248937043)
         ⋮
         

A smaller example may help to understand how this type of interval tree. Consider the first 7 UnitRanges in refrngvecs[:chr01]

smallrngvec = refrngvecs[:chr01][1:15]
15-element Vector{UnitRange{Int32}}:
 11869:12227
 12010:12057
 12179:12227
 12613:12697
 12613:12721
 12975:13052
 13221:13374
 13221:14409
 13453:13670
 14404:14501
 15005:15038
 15796:15947
 16607:16765
 16858:17055
 17233:17368
rn = RangeNode(smallrngvec)
print_tree(rn)
(13221:14409, 17368)
├─ (12613:12697, 13374)
│  ├─ (12010:12057, 12227)
│  │  ├─ (11869:12227, 12227)
│  │  └─ (12179:12227, 12227)
│  └─ (12975:13052, 13374)
│     ├─ (12613:12721, 12721)
│     └─ (13221:13374, 13374)
└─ (15796:15947, 17368)
   ├─ (14404:14501, 15038)
   │  ├─ (13453:13670, 13670)
   │  └─ (15005:15038, 15038)
   └─ (16858:17055, 17368)
      ├─ (16607:16765, 16765)
      └─ (17233:17368, 17368)
  • The UnitRange in the root node is the 7th out of the 15 sorted ranges from which the tree was constructed.
  • Each of the nodes in the tree can have 0, 1, or 2 child nodes. Those with 0 children are called the “leaves” of the tree.
collect(Leaves(rn))
8-element Vector{RangeNode{Int32, Int32}}:
 (11869:12227, 12227)
 (12179:12227, 12227)
 (12613:12721, 12721)
 (13221:13374, 13374)
 (13453:13670, 13670)
 (15005:15038, 15038)
 (16607:16765, 16765)
 (17233:17368, 17368)
  • In addition to the UnitRange it represents, each RangeNode stores maxlast, the maximum value of last(rng) for any UnitRange in the tree rooted at this node. For a leaf maxlast is simply last of its UnitRange.
  • For other nodes, maxlast can be larger than last of its UnitRange.
  • In particular, for the root node maxlast is the maximum of all the last values of the UnitRanges that generated the tree.
maximum(last.(smallrngvec))
17368

RangeTrees.jl defines methods for generics like children, getroot, and nodevalue from AbstractTrees.jl and these allow for many other generics to be applied to a RangeNode.

children(rangetree01)
2-element Vector{RangeNode{Int32, Int32}}:
 (37858488:37858539, 110172217)
 (169855796:169855957, 248937043)

The root of the tree can be obtained from any node using getroot. The combination of getroot and children allows traversal of the tree.

getroot(first(children(first(children(rangetree01))))) # get the root from its grandchild
(110172080:110172489, 248937043)

Several methods for generic functions are exported from RangeTrees.jl

varinfo(RangeTrees)
name size summary
Leaves 40 bytes UnionAll
PostOrderDFS 40 bytes UnionAll
PreOrderDFS 80 bytes UnionAll
RangeNode 80 bytes UnionAll
RangeTrees 347.087 KiB Module
children 0 bytes children (generic function with 11 methods)
getroot 0 bytes getroot (generic function with 3 methods)
intersect! 0 bytes intersect! (generic function with 12 methods)
isroot 0 bytes isroot (generic function with 3 methods)
midrange 0 bytes midrange (generic function with 2 methods)
nodetype 0 bytes nodetype (generic function with 3 methods)
nodevalue 0 bytes nodevalue (generic function with 6 methods)
print_tree 0 bytes print_tree (generic function with 3 methods)
splitrange 0 bytes splitrange (generic function with 2 methods)
treebreadth 0 bytes treebreadth (generic function with 1 method)
treeheight 0 bytes treeheight (generic function with 1 method)
treesize 0 bytes treesize (generic function with 1 method)

but most of these are defined in AbstractTrees.jl.

  • Evaluating and storing maxlast in the nodes allows for overlap searchs to be truncated at nodes for which maxlast < first(target). The sorting by first allows for skipping the right subtree whenever last(target) < first(thisnode) (as in the intersect! method for Vector{UnitRange}).

  • intersect and intersect! methods are already defined in RangeTrees.jl.

  • Check that their results agree with the saved result.

isequal(intersect(target, rangetree01), savedresult)
true
isequal(intersect!(result, target, rangetree01), savedresult)
true

IntervalTrees

  • Create a dictionary of IntervalTrees. It is somewhat tedious to get the type of the result correct and we create a function to hide the details.
function toitrees(rngdict::Dict{S,Vector{UnitRange{T}}}) where {S,T}
  return Dict(
    k => IntervalTree{T,Interval{T}}(Interval.(v)) for (k, v) in rngdict
  )
end
refintvltrees = toitrees(refrngvecs)
intvltree01 = refintvltrees[:chr01]
show(intvltree01)
IntervalTree{Int32, Interval{Int32}}
(11869,12227)
(12010,12057)
(12179,12227)
⋮
(248917279,248917401)
(248917279,248919946)
(248936581,248937043)
  • Creating an intersect! method is also tedious because the package has its own Interval data type and defines intersect(itr::IntervalTree, (frst, lst)) to return an iterator of Intervals in the tree, not the intersection
Interval(target)
Interval{Int32}
(182839365,182887745)
function Base.intersect!(
  res::Vector{UnitRange{T}},
  target::AbstractUnitRange,
  refs::IntervalTree{T},
) where {T}
  empty!(res)
  firstt, lastt = first(target), last(target)
  for isect in intersect(refs, (firstt, lastt))
    push!(res, max(first(isect), firstt):min(last(isect), lastt))
  end
  return res
end
isequal(intersect!(result, target, intvltree01), savedresult)
true

Time for a shootout

Vector{UnitRange}

@benchmark intersect!(res, tar, ref) setup =
  (res = result; tar = target; ref = refrngvec01)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  32.036 μs 58.521 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     33.893 μs                GC (median):    0.00%
 Time  (mean ± σ):   34.070 μs ± 914.027 ns   GC (mean ± σ):  0.00% ± 0.00%
  ▁▂▁▁   ▂▄▃▄▄▂    ▆█    ▆▇▆▂          ▁       ▂             ▂
  ████▁▁▁██████▁▃████▄▃▇████▆▅▆▇▇██████▇▆▃▄▃▃▁█▁▆▇▆▅▁▄▆▅▃▃▄ █
  32 μs         Histogram: log(frequency) by time      37.8 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark intersect(tar, ref) setup = (tar = target; ref = refrngvec01)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  32.079 μs61.796 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     34.389 μs               GC (median):    0.00%
 Time  (mean ± σ):   34.767 μs ±  1.885 μs   GC (mean ± σ):  0.00% ± 0.00%
        ▆     █                                              
  ▁▂▁▁▁▁█▅▃▃▂▂█▆█▅▂▂▅█▂▂▂▃▅▃▂▂▂▂▂▁▁▁▇▄▁▁▁▁▂█▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁ ▂
  32.1 μs         Histogram: frequency by time        39.5 μs <
 Memory estimate: 1.92 KiB, allocs estimate: 4.

RangeTree

@benchmark intersect!(res, tar, ref) setup =
  (res = result; tar = target; ref = rangetree01)
BenchmarkTools.Trial: 10000 samples with 202 evaluations.
 Range (minmax):  386.926 ns769.124 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     409.109 ns                GC (median):    0.00%
 Time  (mean ± σ):   411.822 ns ±  14.841 ns   GC (mean ± σ):  0.00% ± 0.00%
                     ▇█▃       ▂                                 
  ▁▁▂▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▅████▅▃▃▆██▅▄▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  387 ns           Histogram: frequency by time          446 ns <
 Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark intersect(tar, ref) setup = (tar = target; ref = rangetree01)
BenchmarkTools.Trial: 10000 samples with 197 evaluations.
 Range (minmax):  459.046 ns49.638 μs   GC (min … max):  0.00% … 98.68%
 Time  (median):     840.467 ns               GC (median):     0.00%
 Time  (mean ± σ):   847.085 ns ±  2.332 μs   GC (mean ± σ):  13.48% ±  4.82%
    ▁▃                                               █▁       
  ▅▇██▆▃▂▂▂▂▂▂▂▂▁▂▁▂▁▂▂▂▁▂▁▁▁▂▁▁▂▂▁▁▂▂▁▁▁▁▁▂▂▂▂▂▂▂▂▃▇██▆▄▃▂▂ ▃
  459 ns          Histogram: frequency by time          901 ns <
 Memory estimate: 1.92 KiB, allocs estimate: 4.

IntervalTree

@benchmark intersect!(res, tar, ref) setup =
  (res = result; tar = target; ref = intvltree01)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (minmax):  1.095 μs 2.577 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.171 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.173 μs ± 41.586 ns   GC (mean ± σ):  0.00% ± 0.00%
                ▂▄▇▆▁                                       
  ▁▁▂▂▂▃▃▄▃▄▄▄▆▇█████▆▄▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.1 μs         Histogram: frequency by time        1.35 μs <
 Memory estimate: 80 bytes, allocs estimate: 3.
  • In what follows we will use the RangeTree method for intersect! to obtain the intersections.

Determining coverage

  • The coverage of a target by a set of reference ranges is the proportion of the base pairs in the target that intersect with one or more of the reference ranges.

  • We need to somehow count the number of elements in the union of the ranges returned from intersect!.

  • This could be done using the union! method for a BitSet but that approach has two problems: it is comparatively slow and, in Julia versions up to 1.8.0-rc1, it can be wrong.

  • While creating these notes we discovered a bug in the union! method for a BitSet and a UnitRange.

  • There was a PR to fix it the next morning.

  • Versions of Julia prior to 1.8.0-rc2 can (and probably will) return incorrect values of coverage.

  • Replacing union!(bs, isect) by union!(bs, BitSet(isect)) avoids this “infelicity” at the expense of more memory usage and compute time.

  • There is a better method that takes advantage of the intersecting ranges being sorted by first

  • The idea is to “keep moving the goalposts”. When evaluating the coverage count, add the length of the current reference range’s intersection with only the part to the right of what has already been covered. The thing we know about the intersecting ranges is that each successive ranges’s first position is greater than or equal to the first position of all the ranges preceding it. That is, the ranges can’t “move left” as we iterate through them.

function coveragecount(
  target::AbstractUnitRange,
  isects::Vector{UnitRange{T}},
) where {T}
  leftpost, rightpost = T(first(target)), T(last(target))
  coverage = 0
  for isect in isects
    coverage += length(intersect(isect, leftpost:rightpost))
    leftpost = max(leftpost, last(isect) + one(T))
  end
  return coverage
end
coveragecount(target, savedresult)
5639

Iterating over a collection of targets

function overlaps(targets::Vector{UnitRange{T}}, refs::RangeNode{T}) where {T}
  nover = similar(targets, Int)
  covcount = similar(targets, T)
  result = empty!(similar(targets, ceil(Int, sqrt(treesize(refs)))))
  @inbounds for (i, tar) in enumerate(targets)
    nover[i] = length(intersect!(result, tar, refs))
    covcount[i] = coveragecount(tar, result)
  end
  DataFrame(targets = targets, nover = nover, covcount = covcount)
end
overlaps(tarrngvec01, rangetree01)

453,114 rows × 3 columns

targetsnovercovcount
UnitRang…Int64Int32
1165655371:1656556203250
284498368:845061726676
344777749:4477863811837
423692611:2369641891398
523692608:2369641891401
623962516:239634501935
723692686:2369641991324
8114717978:114720613151127
9116392931:116404775192762
1025567642:2556869521054
1123961575:239624962922
128861023:8870500141777
1396678807:966794461573
1437536543:37554276124139
15155308882:155320663565081
1665230872:6523212721256
1723692609:2369639991381
18181089237:18109082611590
19185118995:18512070741602
20203861621:20387008115987
21153990779:15399212710903
2242658437:42676757191431
232309508:23101131606
24173863895:173866795422631
25209768622:209782307336936
2645511237:45521873152936
278861009:88612872279
2843974941:43978232323051
2989052317:890526231305
30109593271:1095935121242
@benchmark overlaps(targets, refs) setup =
  (targets = tarrngvec01; refs = rangetree01)
BenchmarkTools.Trial: 40 samples with 1 evaluation.
 Range (minmax):  123.388 ms134.102 ms   GC (min … max): 0.00% … 5.07%
 Time  (median):     125.436 ms                GC (median):    0.00%
 Time  (mean ± σ):   126.632 ms ±   2.752 ms   GC (mean ± σ):  0.80% ± 1.85%
          ▃▁▁█           ▃                                       
  ▄▁▁▄▁▁▇▇████▁▇▁▄▁▁▁▁▄█▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▄▄▄▄▁▁▁▁▁▁▄ ▁
  123 ms           Histogram: frequency by time          134 ms <
 Memory estimate: 26.23 MiB, allocs estimate: 158533.

The whole shootin’ match

  • The computations on different chromosome are independent of each other and can be assigned to different threads when Julia is started with multiple threads.
function overlaps(
  tardict::Dict{Symbol,Vector{UnitRange{T}}},
  refdict::Dict{Symbol,RangeNode{T,R}},
) where {T,R}
  matchedkeys = intersect(keys(tardict), keys(refdict))
  result = Dict(k => DataFrame() for k in matchedkeys) # pre-assign key/value pairs
  @sync for k in matchedkeys
    Threads.@spawn result[k] = overlaps(tardict[k], refdict[k])
  end
  return result
end
bigresult = overlaps(tarrngvecs, refrngtrees);
bigresult[:chrX]

150,330 rows × 3 columns

targetsnovercovcount
UnitRang…Int64Int32
172272603:72277210151955
2119786512:11979160572123
377826515:7782907932565
479241981:79363480122619
512975122:1297722781691
6119786513:11979161372130
712975119:1297722781694
824465328:24534418162115
912976909:129772173309
10154400549:15440087512327
1112976881:129772273347
1265738963:6574191522412
13154398500:154400899271950
14119786512:11979164072158
1524077771:240779571187
1612975119:1297722481691
1712976265:129772275548
1812975128:1297721281670
1912976915:129772233309
2012975126:1297722781687
2178049074:780499222849
2277899484:779053837988
231386273:13866851413
2475053280:75076550242163
25119584161:1195844212261
26119786525:11979161572120
2772272612:72275658111784
28153907379:153913831382727
2912976976:129772253250
3015384819:15491231152184
  • The result is initialized with all the keys that will be in the result and with values that are empty DataFrames.
  • Individual tasks that are @spawned allocate their DataFrame value and merely update the pointer in result to that value, eliminating the possibility of collisions with other tasks.
@benchmark overlaps(targets, refs) setup =
  (targets = tarrngvecs; refs = refrngtrees)
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (minmax):  1.304 s 1.319 s   GC (min … max): 0.63% … 0.69%
 Time  (median):     1.314 s              GC (median):    0.66%
 Time  (mean ± σ):   1.313 s ± 6.888 ms   GC (mean ± σ):  0.67% ± 0.08%
  █                        █                  █          █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ ▁
  1.3 s         Histogram: frequency by time        1.32 s <
 Memory estimate: 278.21 MiB, allocs estimate: 1724535.

Conclusions

  • The ability to work in the REPL (or VS Code or Jupyter notebooks) encourages iterative refinement of algorithms.
  • “Trust but verify” - when making a change or introducing new methods it helps to have results from previous methods for comparison. In general, continuous integration (CI) testing is straightforward for Julia packages and is strongly encouraged.
  • There are many tools for benchmarking function execution or storage allocation, allowing a developer to concentrate on where the “real problem” is.
  • In certain cases, enhancements like multi-threading can be achieved with very little effort.

Version information

versioninfo()
Julia Version 1.8.0-rc3
Commit 33f19bcbd25 (2022-07-13 19:10 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores