class: middle, centre # Open source tools for working with data --- # What kind of data? The focus will be on tabular data: a set of named columns where each column has a consistent type (string, integer, boolean, date, floating point, complex number, custom type). In this format, a row correspond to an observation and the various columns are observed quantities. -- * Widely used format in data science languages R and Python (data.frame, data.table, pandas). -- * Input format of choice for many analysis packages (e.g. generalized linear model toolkits). -- * Preferred (only?) supported data format in online databases such as SQL. --- # Introduction Open source software development for research: -- * Efficient format for tabular data -- * User-friendly tools for tabular data manipulations -- * Plotting facilities for tabular data (esp. grouped data) -- * Custom array type to incorporate photometry or recordings in tables -- * Toolkit to build web apps (either locally or on a server) following "data flow" --- # Will it work on my data? Even though in the presentation I will mainly use publicly available example datasets, this toolkit has been used in the lab for: * freely-moving and head-fixed rodent behavior * bulk imaging * electro-physiology * human behavior * questionnaire data * fly behavior --- # The Julia programming language -- * Modern, open-source and free programming language -- * Easy to use (interactive console, little "boilerplate") but good performance -- * Rich type system and multiple dispatch allow for fast custom data structures -- * Metaprogramming: Julia can modify its own code before running it --- # StructArrays: flexibly switching from row-based to column-based ```julia using StructArrays s = StructArray(a=1:3, b=["x", "y", "z"]) s[1] # Behaves like an array of structures ``` ``` (a = 1, b = "x") ``` -- ```julia map(row -> exp(row.a), s) # Behaves like an array of structures ``` ``` 3-element Array{Float64,1}: 2.718281828459045 7.38905609893065 20.085536923187668 ``` -- ```julia fieldarrays(s) # Data is stored as columns ``` ``` (a = 1:3, b = ["x", "y", "z"]) ``` --- # StructArrays: technical highlights -- * For immutable structs (`namedtuple` in Python, non-existent in Matlab) of "plain data types" (i.e. no pointers), row iteration does not allocate -- ```julia a = rand(Float32, 26) b = rand(Bool, 26) c = 'a':'z' s = StructArray(a = a, b = b, c = c) @btime $s[3] ``` ``` 3.936 ns (0 allocations: 0 bytes) (a = 0.73948455f0, b = true, c = 'c') ``` -- * Arbitrary column array types are supported: * distributed arrays for parallel computing on a cluster * cuda arrays to run operations on cuda kernels -- ```julia using CuArrays a = CuArray(rand(Float32, 10)) b = CuArray(rand(Bool, 10)) StructArray(a = a, b = b) ``` --- # Technical highlights: why does it matter? * When the data is too big, in memory array types will no longer work. * Some operations on data are "embarrassingly parallel": for example applying a function to each row in the table. Important to be able to compute in parallel (GPU or cluster). Julia allows to pass from single threaded computing on a single core to parallel computing on a cluster (like Pandas plus Dask) or on the GPU (exciting new direction, not fully worked out for tabular datasets). --- # Working with tabular data ```julia using JuliaDBMeta iris = loadtable("/home/pietro/Data/examples/iris.csv") ``` ``` Table with 150 rows, 5 columns: SepalLength SepalWidth PetalLength PetalWidth Species ───────────────────────────────────────────────────────────── 5.1 3.5 1.4 0.2 "setosa" 4.9 3.0 1.4 0.2 "setosa" 4.7 3.2 1.3 0.2 "setosa" 4.6 3.1 1.5 0.2 "setosa" 5.0 3.6 1.4 0.2 "setosa" 5.4 3.9 1.7 0.4 "setosa" 4.6 3.4 1.4 0.3 "setosa" 5.0 3.4 1.5 0.2 "setosa" 4.4 2.9 1.4 0.2 "setosa" ⋮ 5.8 2.7 5.1 1.9 "virginica" 6.8 3.2 5.9 2.3 "virginica" 6.7 3.3 5.7 2.5 "virginica" 6.7 3.0 5.2 2.3 "virginica" 6.3 2.5 5.0 1.9 "virginica" 6.5 3.0 5.2 2.0 "virginica" 6.2 3.4 5.4 2.3 "virginica" 5.9 3.0 5.1 1.8 "virginica" ``` --- # Working with columns External packages implement normal tabular data operations on `StructArrays` (map, filter, join, groupby, etc...) as well as macros to use symbols as if they were columns: ```julia @with iris mean(:SepalLength) / mean(:SepalWidth) ``` ``` 1.9112516354121238 ``` -- ```julia @groupby iris :Species (Mean = mean(:SepalLength), STD = std(:SepalWidth)) ``` ``` Table with 3 rows, 3 columns: Species Mean STD ───────────────────────────── "setosa" 5.006 0.379064 "versicolor" 5.936 0.313798 "virginica" 6.588 0.322497 ``` -- Pop quiz: this simple operation is very common with our data, how many lines of code would it take in the format you are using? What if you were grouping by more than one column? --- # Working with rows ```julia @apply iris begin @transform (Ratio = :SepalLength / :SepalWidth, Large = :PetalWidth > 2) @filter :Ratio > 2 && :Species != "versicolor" end ``` ``` Table with 41 rows, 7 columns: SepalLength SepalWidth PetalLength PetalWidth Species Ratio Large ───────────────────────────────────────────────────────────────────────────── 5.8 2.7 5.1 1.9 "virginica" 2.14815 false 7.1 3.0 5.9 2.1 "virginica" 2.36667 true 6.3 2.9 5.6 1.8 "virginica" 2.17241 false 6.5 3.0 5.8 2.2 "virginica" 2.16667 true 7.6 3.0 6.6 2.1 "virginica" 2.53333 true 7.3 2.9 6.3 1.8 "virginica" 2.51724 false 6.7 2.5 5.8 1.8 "virginica" 2.68 false 6.5 3.2 5.1 2.0 "virginica" 2.03125 false 6.4 2.7 5.3 1.9 "virginica" 2.37037 false ⋮ 6.7 3.1 5.6 2.4 "virginica" 2.16129 true 6.9 3.1 5.1 2.3 "virginica" 2.22581 true 5.8 2.7 5.1 1.9 "virginica" 2.14815 false 6.8 3.2 5.9 2.3 "virginica" 2.125 true 6.7 3.3 5.7 2.5 "virginica" 2.0303 true 6.7 3.0 5.2 2.3 "virginica" 2.23333 true 6.3 2.5 5.0 1.9 "virginica" 2.52 false 6.5 3.0 5.2 2.0 "virginica" 2.16667 false ``` --- # Plotting can be part of the pipeline ```julia using StatsPlots @apply iris begin @transform (Ratio = :SepalLength/:SepalWidth, Sum = :SepalLength+:SepalWidth) @df corrplot([:Ratio :Sum]) end ``` ![](../corrplot.svg) --- # Plotting grouped data ```julia school = loadtable("/home/pietro/Data/examples/school.csv") ``` ``` Table with 7185 rows, 8 columns: School Minrty Sx SSS MAch MeanSES Sector CSES ────────────────────────────────────────────────────────────────────────── 1224 "No" "Female" -1.528 5.876 -0.434383 "Public" -1.09362 1224 "No" "Female" -0.588 19.708 -0.434383 "Public" -0.153617 1224 "No" "Male" -0.528 20.349 -0.434383 "Public" -0.093617 1224 "No" "Male" -0.668 8.781 -0.434383 "Public" -0.233617 1224 "No" "Male" -0.158 17.898 -0.434383 "Public" 0.276383 1224 "No" "Male" 0.022 4.583 -0.434383 "Public" 0.456383 1224 "No" "Female" -0.618 -2.832 -0.434383 "Public" -0.183617 1224 "No" "Male" -0.998 0.523 -0.434383 "Public" -0.563617 1224 "No" "Female" -0.888 1.527 -0.434383 "Public" -0.453617 ⋮ 9586 "No" "Female" 1.212 15.26 0.621153 "Catholic" 0.590847 9586 "No" "Female" 1.022 22.78 0.621153 "Catholic" 0.400847 9586 "Yes" "Female" 1.612 20.967 0.621153 "Catholic" 0.990847 9586 "No" "Female" 1.512 20.402 0.621153 "Catholic" 0.890847 9586 "No" "Female" -0.038 14.794 0.621153 "Catholic" -0.659153 9586 "No" "Female" 1.332 19.641 0.621153 "Catholic" 0.710847 9586 "No" "Female" -0.008 16.241 0.621153 "Catholic" -0.629153 9586 "No" "Female" 0.792 22.733 0.621153 "Catholic" 0.170847 ``` --- # Plotting grouped data ```julia using GroupedErrors @> school begin @splitby _.Sx @across _.School @x _.MAch @y :cumulative @plot plot(xlabel = "MAch", ylabel = "CDF") end ``` ![](../cumulative.svg) --- # Adding neural data to a table: ShiftedArrays In a typical dataset, recordings and behavior are mismatched: * Behavioral data => hundreds of rows (trials) * Neural data => hundreds of thousands of frames (photometry) -- The package ShiftedArrays addresses this issue by creating a custom array type which is a normal array with a shift: -- ```julia using ShiftedArrays v = rand(10) lead(v, 3) ``` ``` 10-element ShiftedArrays.ShiftedArray{Float64,Missing,1,Array{Float64,1}}: 0.24827770861604015 0.9677212954468728 0.5291375982107958 0.24384039893632092 0.36103265049669164 0.037775409555429906 0.9030809622694929 missing missing missing ``` --- # Adding neural data to a table: ShiftedArrays The underlying data is shared, so creating a `ShiftedArray` is very cheap: ```julia v = rand(1_000_000) @benchmark lead($v, 100) ``` ``` BenchmarkTools.Trial: memory estimate: 32 bytes allocs estimate: 1 -------------- minimum time: 4.456 ns (0.00% GC) median time: 5.659 ns (0.00% GC) mean time: 11.089 ns (39.80% GC) maximum time: 3.829 μs (99.45% GC) -------------- samples: 10000 evals/sample: 1000 ``` --- # Adding neural data to a table: ShiftedArrays A new column can be added by simply putting shifted copies of the recordings: ```julia photometry = rand(100) timestamps = [12, 23, 48, 97] shiftedvecs = [lead(photometry, time) for time in timestamps] ShiftedArrays.to_array(shiftedvecs, -5:5) ``` ``` 11×4 Array{Union{Missing, Float64},2}: 0.0399125 0.0772857 0.414822 0.811259 0.662218 0.543966 0.246542 0.473125 0.557359 0.0309693 0.130107 0.903247 0.542658 0.567708 0.76579 0.185107 0.413107 0.360279 0.371334 0.807062 0.928122 0.0397899 0.0513499 0.513011 0.878856 0.954279 0.392092 0.619644 0.342331 0.480265 0.459927 0.110585 0.0987968 0.189936 0.676036 0.991137 0.828208 0.375395 0.819465 missing 0.99039 0.217584 0.533125 missing ``` --- # Computing aligned summary statistics ShiftedArrays also provides utility function to reduce the data: ```julia reduce_vec(mean, shiftedvecs, -5:5) ``` ``` 11-element Array{Float64,1}: 0.33581988018640335 0.4814627406914195 0.4054206499806484 0.5153156747118339 0.4879453823859449 0.383068069702968 0.7112178393162004 0.34827712364683655 0.48897635705722026 0.6743559888178683 0.5803664326670401 ``` --- # Computing aligned summary statistics ```julia reduce_vec(mean, shiftedvecs, -5:5) ``` ![](../figures/vectors.svg) --- # Computing aligned summary statistics A similar logic can be applied to matrices (or higher dimensional tensors). For example, one dimension (shifted) is time and another dimension (non shifted) is cell identity: ![](../figures/matrices.svg) --- # Plotting support provided by GroupedErrors ```julia @> df begin @splitby _.treatment @across _.subject @x -100:100 :discrete @y _.signal @plot plot() :ribbon end ``` ![](../photometry.svg) --- # Easy web based interfaces: Interact * Create web-based interactive interfaces with little code * Works (in theory) locally with electron, from a server, on the jupyter notebook and in atom
--- ### Widgets and logic: ```julia using StatsPlots, Interact color = colorpicker() npoints = slider(10:100, label = "npoints") markersize = slider(3:10, label = "markersize") label = textbox("insert legend entry") plt = Interact.@map scatter( rand(&npoints), rand(&npoints), color = &color, markersize = &markersize, label = &label ) ``` -- ### Layout: ```julia ui = vbox( color, npoints, markersize, label, plt ) ``` --- # Deployment Locally: ```julia using Blink w = Window() body!(w, ui) ``` -- In the browser (for data sharing / interactive presentations either in the lab or in big projects like IBL): ```julia using Mux WebIO.webio_serve(page("/", req -> ui)) ``` Caveat: better to put code in a function to serve independent widgets to different users connecting to the same site. --- # Simple app for data analysis The same logic can be applied to create an app to do basic analysis on a table. In a few lines of code create: * a filepicker to select data * a set of filters to subselect (it adjusts automatically to the table provided) * an editor for custom data operations * a visual way to select what plot to do All organized in separate tabs. --- ### Widgets and logic: ```julia filtered_data = selectors(input_data) edited_data = dataeditor(filtered_data) viewer = dataviewer(edited_data) ``` -- ### Layout: ```julia tabs = OrderedDict( :filtered_data => filtered_data, :edited_data => edited_data, :viewer => viewer ) ui = tabulator(tabs) ``` --- # More interactive plotting A newer plotting framework ([Makie](http://juliaplots.org/MakieGallery.jl/stable/index.html) by Simon Danisch: Julia + OpenGL) provides enhanced interactivity in two ways: * Excellent rendering performance (interactive speed with large datasets) * The plot and the UI controls can share signals. -- **Disclaimer**: I've ported the StatsPlots package to StatsMakie but there are still some quirks to iron out before I can switch to using it exclusively. --- # Eye catching demos (in house): Makie for whole brain neural activity (cFOS)
Data and classification from Baylor Brangers and Diogo Matias --- # Eye catching demos (in the wild): combining Makie and Interact
Video Credits: George Datseris and JuliaDynamics organization --- # References [StructArrays](https://github.com/piever/StructArrays.jl) [JuliaDBMeta](https://piever.github.io/JuliaDBMeta.jl/latest/) [StatsPlots](https://github.com/JuliaPlots/StatsPlots.jl) [GroupedErrors](https://github.com/piever/GroupedErrors.jl) [ShiftedArrays](https://github.com/piever/ShiftedArrays.jl) [Interact](https://github.com/JuliaGizmos/Interact.jl) [TableWidgets](https://github.com/piever/TableWidgets.jl) [Makie](http://juliaplots.org/MakieGallery.jl/stable/index.html) [StatsMakie](https://github.com/JuliaPlots/StatsMakie.jl)