POST
Manipulating data: JuliaDB
JuliaDB tutorial
Introduction
From now on I'll focus on more advanced Julia tutorials. Here I'll talk about working with data using the JuliaDB package to reproduce a well known tutorial. This tutorial is available as a Jupyter notebook here.
Getting the data
The data is some example flight dataset that you can find here.
Simply open the link and choose Save as
from the File
menu in your browser to save the data to a folder on your computer.
Loading the data
Loading a csv file is straightforward with JuliaDB:
using JuliaDB, IndexedTables
flights = loadtable("/home/pietro/Documents/testdata/hflights.csv");
Of course, replace the path with the location of the dataset you have just downloaded.
Filtering the data
In order to select only rows matching certain criteria, use the filter
function:
filter(i -> (i.Month == 1) && (i.DayofMonth == 1), flights);
To test if one of two conditions is verified:
filter(i -> (i.UniqueCarrier == "AA") || (i.UniqueCarrier == "UA"), flights)
# in this case, you can simply test whether the `UniqueCarrier` is in a given list:
filter(i -> i.UniqueCarrier in ["AA", "UA"], flights);
Select: pick columns by name
You can use the select
function to select a subset of columns:
select(flights, (:DepTime, :ArrTime, :FlightNum))
Table with 227496 rows, 3 columns:
DepTime ArrTime FlightNum
───────────────────────────
1400 1500 428
1401 1501 428
1352 1502 428
1403 1513 428
1405 1507 428
1359 1503 428
1359 1509 428
1355 1454 428
1443 1554 428
1443 1553 428
1429 1539 428
1419 1515 428
⋮
1939 2119 124
556 745 280
1026 1208 782
1611 1746 1050
758 1051 201
1307 1600 471
1818 2111 1191
2047 2334 1674
912 1031 127
656 812 621
1600 1713 1597
Let's select all columns between :Year
and :Month
as well as all columns containing "Taxi" or "Delay" in their name. Between
selects columns between two specified extremes, passing a function filters column names by that function and All
takes the union of all selectors (or all columns, if no selector is specified).
select(flights, All(Between(:Year, :DayofMonth), i -> contains(string(i), "Taxi"), i -> contains(string(i), "Delay")))
Table with 227496 rows, 7 columns:
Year Month DayofMonth TaxiIn TaxiOut ArrDelay DepDelay
────────────────────────────────────────────────────────────
2011 1 1 7 13 -10 0
2011 1 2 6 9 -9 1
2011 1 3 5 17 -8 -8
2011 1 4 9 22 3 3
2011 1 5 9 9 -3 5
2011 1 6 6 13 -7 -1
2011 1 7 12 15 -1 -1
2011 1 8 7 12 -16 -5
2011 1 9 8 22 44 43
2011 1 10 6 19 43 43
2011 1 11 8 20 29 29
2011 1 12 4 11 5 19
⋮
2011 12 6 4 15 14 39
2011 12 6 13 9 -10 -4
2011 12 6 4 12 -12 1
2011 12 6 3 9 -9 16
2011 12 6 3 10 -4 -2
2011 12 6 5 10 0 7
2011 12 6 5 11 -9 8
2011 12 6 4 9 4 7
2011 12 6 4 14 -4 -3
2011 12 6 3 9 -13 -4
2011 12 6 3 11 -12 0
Applying several operations
If one wants to apply several operations one after the other, there are two main approaches:
- nesting
- piping
Let's assume we want to select UniqueCarrier
and DepDelay
columns and filter for delays over 60 minutes. The nesting approach would be:
filter(i -> i.DepDelay > 60, select(flights, (:UniqueCarrier, :DepDelay)))
Table with 10242 rows, 2 columns:
UniqueCarrier DepDelay
───────────────────────
"AA" 90
"AA" 67
"AA" 74
"AA" 125
"AA" 82
"AA" 99
"AA" 70
"AA" 61
"AA" 74
"AS" 73
"B6" 136
"B6" 68
⋮
"WN" 129
"WN" 61
"WN" 70
"WN" 76
"WN" 63
"WN" 144
"WN" 117
"WN" 124
"WN" 72
"WN" 70
"WN" 78
For piping, we'll use the excellent Lazy package.
import Lazy
Lazy.@as x flights begin
select(x, (:UniqueCarrier, :DepDelay))
filter(i -> i.DepDelay > 60, x)
end
Table with 10242 rows, 2 columns:
UniqueCarrier DepDelay
───────────────────────
"AA" 90
"AA" 67
"AA" 74
"AA" 125
"AA" 82
"AA" 99
"AA" 70
"AA" 61
"AA" 74
"AS" 73
"B6" 136
"B6" 68
⋮
"WN" 129
"WN" 61
"WN" 70
"WN" 76
"WN" 63
"WN" 144
"WN" 117
"WN" 124
"WN" 72
"WN" 70
"WN" 78
where the variable x
denotes our data at each stage. At the beginning it is flights
, then it only has the two relevant columns and, at the last step, it is filtered.
Reorder rows
Select UniqueCarrier
and DepDelay
columns and sort by DepDelay
:
sort(flights, :DepDelay, select = (:UniqueCarrier, :DepDelay))
Table with 227496 rows, 2 columns:
UniqueCarrier DepDelay
───────────────────────
"OO" -33
"MQ" -23
"XE" -19
"XE" -19
"CO" -18
"EV" -18
"XE" -17
"CO" -17
"XE" -17
"MQ" -17
"XE" -17
"DL" -17
⋮
"US" #NA
"US" #NA
"US" #NA
"WN" #NA
"WN" #NA
"WN" #NA
"WN" #NA
"WN" #NA
"WN" #NA
"WN" #NA
"WN" #NA
or, in reverse order:
sort(flights, :DepDelay, select = (:UniqueCarrier, :DepDelay), rev = true)
Apply a function row by row
To apply a function row by row, use map
: the first argument is the anonymous function, the second is the dataset.
speed = map(i -> i.Distance / i.AirTime * 60, flights)
227496-element DataValues.DataValueArray{Float64,1}:
336.0
298.667
280.0
344.615
305.455
298.667
312.558
336.0
327.805
298.667
320.0
327.805
305.455
⋮
261.818
508.889
473.793
479.302
496.627
468.6
478.163
483.093
498.511
445.574
424.688
460.678
Add new variables
Use the pushcol
function to add a column to an existing dataset:
pushcol(flights, :Speed, speed);
If you need to add the new column to the existing dataset:
flights = pushcol(flights, :Speed, speed);
Reduce variables to values
To get the average delay, we first filter away datapoints where ArrDelay
is missing, then group by :Dest
, select :ArrDelay
and compute the mean:
groupby(@NT(avg_delay = mean∘dropna), flights, :Dest, select = :ArrDelay)
Table with 116 rows, 2 columns:
Dest avg_delay
────────────────
"ABQ" 7.22626
"AEX" 5.83944
"AGS" 4.0
"AMA" 6.8401
"ANC" 26.0806
"ASE" 6.79464
"ATL" 8.23325
"AUS" 7.44872
"AVL" 9.97399
"BFL" -13.1988
"BHM" 8.69583
"BKG" -16.2336
⋮
"SJU" 11.5464
"SLC" 1.10485
"SMF" 4.66271
"SNA" 0.35801
"STL" 7.45488
"TPA" 4.88038
"TUL" 6.35171
"TUS" 7.80168
"TYS" 11.3659
"VPS" 12.4572
"XNA" 6.89628
Performance tip
If you'll group often by the same variable, you can sort your data by that variable at once to optimize future computations.
sortedflights = reindex(flights, :Dest)
Table with 227496 rows, 22 columns:
Columns:
# colname type
────────────────────────────────────────────────────
1 Dest String
2 Year Int64
3 Month Int64
4 DayofMonth Int64
5 DayOfWeek Int64
6 DepTime DataValues.DataValue{Int64}
7 ArrTime DataValues.DataValue{Int64}
8 UniqueCarrier String
9 FlightNum Int64
10 TailNum String
11 ActualElapsedTime DataValues.DataValue{Int64}
12 AirTime DataValues.DataValue{Int64}
13 ArrDelay DataValues.DataValue{Int64}
14 DepDelay DataValues.DataValue{Int64}
15 Origin String
16 Distance Int64
17 TaxiIn DataValues.DataValue{Int64}
18 TaxiOut DataValues.DataValue{Int64}
19 Cancelled Int64
20 CancellationCode String
21 Diverted Int64
22 Speed DataValues.DataValue{Float64}
using BenchmarkTools
println("Presorted timing:")
@benchmark groupby(@NT(avg_delay = mean∘dropna), sortedflights, select = :ArrDelay)
Presorted timing:
BenchmarkTools.Trial:
memory estimate: 3.96 MiB
allocs estimate: 2189
--------------
minimum time: 7.407 ms (0.00% GC)
median time: 7.892 ms (0.00% GC)
mean time: 8.167 ms (2.90% GC)
maximum time: 10.980 ms (13.07% GC)
--------------
samples: 612
evals/sample: 1
println("Non presorted timing:")
@benchmark groupby(@NT(avg_delay = mean∘dropna), flights, :Dest, select = :ArrDelay)
Non presorted timing:
BenchmarkTools.Trial:
memory estimate: 7.44 MiB
allocs estimate: 2353
--------------
minimum time: 112.555 ms (0.00% GC)
median time: 114.339 ms (0.00% GC)
mean time: 115.784 ms (0.33% GC)
maximum time: 130.845 ms (0.00% GC)
--------------
samples: 44
evals/sample: 1
Using summarize
, we can summarize several columns at the same time:
summarize(mean∘dropna, flights, :Dest, select = (:Cancelled, :Diverted))
# For each carrier, calculate the minimum and maximum arrival and departure delays:
cols = Tuple(find(i -> contains(string(i), "Delay"), colnames(flights)))
summarize(@NT(min = minimum∘dropna, max = maximum∘dropna), flights, :UniqueCarrier, select = cols)
Table with 15 rows, 5 columns:
UniqueCarrier ArrDelay_min DepDelay_min ArrDelay_max DepDelay_max
─────────────────────────────────────────────────────────────────────
"AA" -39 -15 978 970
"AS" -43 -15 183 172
"B6" -44 -14 335 310
"CO" -55 -18 957 981
"DL" -32 -17 701 730
"EV" -40 -18 469 479
"F9" -24 -15 277 275
"FL" -30 -14 500 507
"MQ" -38 -23 918 931
"OO" -57 -33 380 360
"UA" -47 -11 861 869
"US" -42 -17 433 425
"WN" -44 -10 499 548
"XE" -70 -19 634 628
"YV" -32 -11 72 54
For each day of the year, count the total number of flights and sort in descending order:
Lazy.@as x flights begin
groupby(length, x, :DayofMonth)
sort(x, :length, rev = true)
end
Table with 31 rows, 2 columns:
DayofMonth length
──────────────────
28 7777
27 7717
21 7698
14 7694
7 7621
18 7613
6 7606
20 7599
11 7578
13 7546
10 7541
17 7537
⋮
25 7406
16 7389
8 7366
12 7301
4 7297
19 7295
24 7234
5 7223
30 6728
29 6697
31 4339
For each destination, count the total number of flights and the number of distinct planes that flew there
groupby(@NT(flight_count = length, plane_count = length∘union), flights, :Dest, select = :TailNum)
Table with 116 rows, 3 columns:
Dest flight_count plane_count
────────────────────────────────
"ABQ" 2812 716
"AEX" 724 215
"AGS" 1 1
"AMA" 1297 158
"ANC" 125 38
"ASE" 125 60
"ATL" 7886 983
"AUS" 5022 1015
"AVL" 350 142
"BFL" 504 70
"BHM" 2736 616
"BKG" 110 63
⋮
"SJU" 391 115
"SLC" 2033 368
"SMF" 1014 184
"SNA" 1661 67
"STL" 2509 788
"TPA" 3085 697
"TUL" 2924 771
"TUS" 1565 226
"TYS" 1210 227
"VPS" 880 224
"XNA" 1172 177
Window functions
In the previous section, we always applied functions that reduced a table or vector to a single value.
Window functions instead take a vector and return a vector of the same length, and can also be used to
manipulate data. For example we can rank, within each UniqueCarrier
, how much
delay a given flight had and figure out the day and month with the two greatest delays:
using StatsBase
fc = filter(t->!isnull(t.DepDelay), flights)
gfc = groupby(fc, :UniqueCarrier, select = (:Month, :DayofMonth, :DepDelay), flatten = true) do dd
rks = ordinalrank(column(dd, :DepDelay), rev = true)
sort(dd[rks .<= 2], by = i -> i.DepDelay, rev = true)
end
Table with 30 rows, 4 columns:
UniqueCarrier Month DayofMonth DepDelay
──────────────────────────────────────────
"AA" 12 12 970
"AA" 11 19 677
"AS" 2 28 172
"AS" 7 6 138
"B6" 10 29 310
"B6" 8 19 283
"CO" 8 1 981
"CO" 1 20 780
"DL" 10 25 730
"DL" 4 5 497
"EV" 6 25 479
"EV" 1 5 465
⋮
"OO" 4 4 343
"UA" 6 21 869
"UA" 9 18 588
"US" 4 19 425
"US" 8 26 277
"WN" 4 8 548
"WN" 9 29 503
"XE" 12 29 628
"XE" 12 29 511
"YV" 4 22 54
"YV" 4 30 46
Though in this case, it would have been simpler to use Julia partial sorting:
groupby(fc, :UniqueCarrier, select = (:Month, :DayofMonth, :DepDelay), flatten = true) do dd
select(dd, 1:2, by = i -> i.DepDelay, rev = true)
end
Table with 30 rows, 4 columns:
UniqueCarrier Month DayofMonth DepDelay
──────────────────────────────────────────
"AA" 12 12 970
"AA" 11 19 677
"AS" 2 28 172
"AS" 7 6 138
"B6" 10 29 310
"B6" 8 19 283
"CO" 8 1 981
"CO" 1 20 780
"DL" 10 25 730
"DL" 4 5 497
"EV" 6 25 479
"EV" 1 5 465
⋮
"OO" 4 4 343
"UA" 6 21 869
"UA" 9 18 588
"US" 4 19 425
"US" 8 26 277
"WN" 4 8 548
"WN" 9 29 503
"XE" 12 29 628
"XE" 12 29 511
"YV" 4 22 54
"YV" 4 30 46
For each month, calculate the number of flights and the change from the previous month
using ShiftedArrays
y = groupby(length, flights, :Month)
lengths = columns(y, :length)
pushcol(y, :change, lengths .- lag(lengths))
Table with 12 rows, 3 columns:
Month length change
─────────────────────
1 18910 NaN
2 17128 -1782
3 19470 2342
4 18593 -877
5 19172 579
6 19600 428
7 20548 948
8 20176 -372
9 18065 -2111
10 18696 631
11 18021 -675
12 19117 1096
Warning
missing
(the official Julia way of representing missing data) has not yet been adopted by JuliaDB, so using ShiftedArrays in combination with JuliaDB may be slightly troublesome in Julia 0.6. The situation should be solved in Julia 0.7, where the adoption of missing
should become more widespread. You can use a different default value with ShiftedArrays (for example, with an Array
of Float64
you could do:
v = [1.2, 2.3, 3.4]
lag(v, default = NaN)
3-element ShiftedArrays.ShiftedArray{Float64,Float64,1,Array{Float64,1}}:
NaN
1.2
2.3
Visualizing your data
The StatPlots and GroupedErrors package as well as native plotting recipes from JuliaDB using OnlineStats make a rich set of visualizations possible with an intuitive syntax.
Use the @df
macro to be able to refer to columns simply by their name. You can work with these symobls as if they are regular vectors. Here for example, we split data according to whether the distance is smaller or bigger than 1000
.
using StatPlots
gr(fmt = :png) # choose the fast GR backend and set format to png: svg would probably crash with so many points
@df flights scatter(:DepDelay, :ArrDelay, group = :Distance .> 1000, fmt = :png, layout = 2, legend = :topleft)
Unsurprisingly, departure and arrival delay are strongly correlated, regardless of distance traveled.
To split and combine the data in various more complicated forms, in presence of a "population" (or "grouping") variable, you can also use GroupedErrors.
In the following example we will plot the density of arrival delays spliting on distance being greater or smaller than a given constant value. If we are unsure how to smooth, we can set the smoothing variable with a slider using the Interact package.
The shaded area correspond to the standard error of our estimated density across UniqueCarrier
. It shows how much variability to expect in our estimate due to having a small population of UniqueCarrier
(i.e. airline companies).
using GroupedErrors
import Interact
Interact.@manipulate for bw in 1:20
@> flights begin
@splitby _.Distance > 1000
@across _.UniqueCarrier
@x _.ArrDelay
@y :density bandwidth = bw
@plot plot(xlims = (-50, 200))
end
end
For large datasets, summary statistics can be computed using efficient online algorithms implemnted in OnlineStats. Here we will use an online algorithm to compute the mean
traveled distance split across month of the year.
using OnlineStats
groupreduce(Mean(), flights, :Month; select = :Distance)
A utility function is provided to compute statistics on a partition of the data. Here for example we compute the extrema
of the travelled distance for each section of the dataset. Using the by
keyword we can run the analysis separately according to a splitting variable, here we'll be splitting by month:
partitionplot(flights, :Distance, stat = Extrema(), by = :Month, layout = 12, legend = false, xticks = [])
One can also plot one variable against the other, here for example we'll be plotting DepDelay
versus ArrDelay
. Be careful that, if your dataset has missing data and you want JuliaDB to filter it out for you before plotting, you need to use the keyword argument dropmissing
. Note how much faster this plot is, compared to the scatter plot above.
partitionplot(flights, :DepDelay, :ArrDelay, stat = Hist(10), dropmissing = true, legend = :topleft)
Interfacing with online datasets
JuliaDB can also smoothly interface online datasets using packages from the JuliaDatabases organization. Here's how it would work with a MySQL dataset:
using MySQL, JuliaDB
conn = MySQL.connect(host::String, user::String, passwd::String; db::String = "") # edit as needed for your dataset
MySQL.query(conn, "SELECT Name, Salary FROM Employee;") |> table # execute the query and collect as a table
MySQL.disconnect(conn)