Convenient plotting of distribution shapes in R

I needed to compare the shapes of a few distributions, and therefore wanted an easy way to plot them in R. For example, the standard normal can be plotted quickly enough with curve(dnorm(x, mean=0, sd=1), from=-3, to=3). But when comparing multiple ones, there could be something a bit more handy in terms of keeping track of parameter values for each. So, I ended up writing a little convenience function for that purpose.

As an example, a few beta distributions can be compared with:

plot_dist(dbeta, c(shape1=2, shape2=8))
plot_dist(dbeta, c(shape1=2, shape2=5), col="red", add=TRUE)
plot_dist(dbeta, c(shape1=2, shape2=3), col="blue", add=TRUE)
plot_dist(dbeta, c(shape1=2, shape2=2), col="darkgreen", add=TRUE)


The code for plot_dist() is in this gist.

Posted in R | Comments Off on Convenient plotting of distribution shapes in R

How good is Patrik Laine’s shot?

Patrik Laine is playing his first season in the NHL and currently leads the league in scoring with 12 goals. With 51 shots on goal, his shooting percentage is 23.5 %. How does this number compare to great goal scores over the years? I downloaded NHL player statistics for each season from 1967–1968 onwards, which was the first year the number of shots was recorded. I then calculated career summaries for each player. But if we simply look for players with the highest shooting percentages, the first 12 all scored one goal with just one shot. Obviously these are not the best shooters, just some random flukes. In its official leaderboard for all-time career shooting percentage (S%), the NHL only counts players with at least 800 shots. This is what the top 10 looks like:

Name Pos GP G A PTS S S%
1 Craig Simpson LW 634 247 250 497 1,044 23.7
2 Charlie Simmer LW 712 342 369 711 1,531 22.3
3 Paul MacLean RW 719 324 349 673 1,513 21.4
4 Mike Bossy RW 752 573 553 1,126 2,705 21.2
5 Yvon Lambert LW 683 206 273 479 1,038 19.8
6 Rick Middleton RW 1,005 448 540 988 2,275 19.7
7 Blaine Stoughton RW 526 258 191 449 1,322 19.5
8 Darryl Sutter LW 406 161 118 279 829 19.4
9 Rob Brown RW 543 190 248 438 979 19.4
10 Mike Ridley C 866 292 466 758 1,513 19.3

Requiring a minimum number of shots (or goals) does get rid of the flukes, but how can you compare a rookie player? What kind of method could be used to take the scarcity of evidence into account, until the player catches up with the threshold? David Robinson has written a terrific series of articles for situations like this, using baseball statistics as an example. I’ll follow one of his tutorials and use empirical Bayes estimation to obtain a more reliable picture. In short, we’ll first use all players’ data to obtain an estimate for a beta prior, and then use each player’s own data to update the prior based on individual evidence. Put another way, we start by assuming everyone is average, and if and only if they show more and more evidence to the contrary, we start to gradually consider them as special. For a more much better description, please see the original blog post. All R code is also adapted from that post.

Before we get to estimation of the beta prior, let’s first check if we should use all of the available data or only a subset. Since in this case we are estimating only one prior, we would like all players to come from a single distribution. As the gameplay has surely changed a bit over the years, let’s look at the overall shooting percentages over the 49 seasons. Also, since defensemen normally play futher away from the opponent’s net than forwards, player position is likely to have an effect as well. Let’s look at shooting percentages separately for each position (excluding players with less than ten goals).



As we can see, shooting percentages used to be much higher around the 1980s. For this simple analysis, I’ll only include data from season 1996–1997 onwards. I’ll also leave out the defensemen, as they tend to have lower shooting percentages. (I hope to write follow-ups posts later with all of the data included and handled properly, either using some of the other approaches David has described for empirical Bayes, with a standard Bayesian analysis, or maybe even both.)

Overall, the average shooting percentage for all forwards over the last 20 seasons is 11.0 %. Next, let’s estimate a beta prior from the data and see how it fits:


Shooting percentages can now be adjusted using this prior. This will shrink individual players’ estimates towards the horizontal dashed line. The more evidence there is for an individual (the brighter the blue dot), the more we trust it. The darker dots show a lot of shrinkage, whereas the light ones are much closer to the diagonal red line, which marks the case of no shrinkage at all.


Finally, let’s look at the ranking (from season 1996–1997 onwards) for shooting percentage estimated with empirical Bayes (EB). Patrik Laine currently sits at number 40, and only time will tell where he moves on that list. But what we do know today, is that he is one of only four 18-year-olds to score two hat tricks in the NHL (others being Jack Hamilton, Dale Hawerchuk, and Trevor Linden), and he still has the rest of the regular season to hunt for a third one before his 19th birthday on April 19th, 2017.

Name Pos GP G A PTS S S% EB
1 Alex Tanguay LW 1,088 283 580 863 1,525 18.6 17.9
2 Andrew Brunette LW 1,099 265 462 727 1,500 17.7 17.1
3 Steven Stamkos C 586 321 261 582 1,876 17.1 16.7
4 Mark Parrish RW 722 216 171 387 1,247 17.3 16.7
5 Dmitri Khristich C 420 111 171 282 633 17.5 16.3
6 Mike Ridley C 75 20 32 52 79 25.3 16.1
7 Tomas Holmstrom LW 1,026 243 287 530 1,489 16.3 15.9
8 Gary Roberts LW 639 181 224 405 1,120 16.2 15.6
9 Brenden Morrow LW 991 265 310 575 1,670 15.9 15.5
10 Jan Hrdina C 513 101 196 297 619 16.3 15.3
11 Jason Allison C 519 152 326 478 962 15.8 15.2
12 Ziggy Palffy RW 565 276 333 609 1,799 15.3 15.0
13 John LeClair LW 624 281 274 555 1,833 15.3 15.0
14 Alexander Mogilny RW 530 207 274 481 1,341 15.4 15.0
15 Pierre Turgeon C 622 197 351 548 1,285 15.3 14.9
16 Tyler Bozak C 451 112 169 281 717 15.6 14.8
17 Sergei Kostitsyn LW 353 67 109 176 414 16.2 14.8
18 Joe Nieuwendyk C 628 236 242 478 1,555 15.2 14.8
19 Anson Carter RW 674 202 219 421 1,331 15.2 14.8
20 Tony Hrkac C 425 70 105 175 438 16.0 14.7
21 Mark Messier C 555 155 264 419 1,015 15.3 14.7
22 Yanic Perreault C 742 217 237 454 1,436 15.1 14.7
23 Adam Deadmarsh RW 441 154 154 308 1,010 15.2 14.7
24 Adam Henrique C 364 101 109 210 650 15.5 14.7
25 Teemu Selanne RW 1,192 521 594 1,115 3,528 14.8 14.6
26 Jonathan Toews C 662 255 321 576 1,710 14.9 14.6
27 Jiri Hudler C 680 161 256 417 1,068 15.1 14.6
28 Paul Byron C 217 34 43 77 198 17.2 14.5
29 Brad Marchand C 470 158 147 305 1,057 14.9 14.5
30 Sidney Crosby C 716 348 603 951 2,376 14.6 14.4
31 Mike Sillinger C 831 210 234 444 1,420 14.8 14.4
32 Keith Tkachuk LW 893 394 382 776 2,713 14.5 14.3
33 Peter Forsberg C 579 204 515 719 1,390 14.7 14.3
34 Milan Lucic LW 664 164 242 406 1,111 14.8 14.3
35 Dany Heatley RW 869 372 419 791 2,565 14.5 14.3
36 David Desharnais C 420 78 168 246 511 15.3 14.3
37 Martin Straka LW 714 206 370 576 1,408 14.6 14.3
38 Thomas Vanek LW 824 320 337 657 2,213 14.5 14.2
39 Stephane Matteau LW 471 75 88 163 493 15.2 14.2
40 Patrik Laine RW 18 12 5 17 51 23.5 14.2
Posted in R, sports | Comments Off on How good is Patrik Laine’s shot?

Boat trips and weather observations

Another boating season is over, so I updated the shiny app of my boat trips with another year’s worth of Moves app data.

I also added weather and wave observations from the API of the Finnish Meteorological Institute. Clicking on a track brings up these details, provided the closest weather stations and wave buoys are within 30 nautical miles from the track. They are therefore only approximations, and can vary from the conditions actually experienced on the boat, depending on its location relative to the stations/buoys and nearby islands.

It should also be noted that the tracking accuracy of Moves is lower than that of proper chart plotters, which is only natural considering battery consumption. There are also clear errors when close to shore, as Moves has a tendency to place the location on known roads. Tracks and all values should therefore be taken with a grain of salt.


The source code can be found on GitHub, and the app itself is here.

Posted in R, sailing, transportation | Comments Off on Boat trips and weather observations

Finland’s dependency ratio and pension contributions

In March, I made simple visualizations of Finland’s population structure and statutory pension contributions. I also made a third one, combining the two topics, but never posted it. Here it is.


Shown in white is the aged dependency ratio, which is the proportion of people aged 65 and over, compared to those between 15 and 64. A solid line shows historical data, and future predictions are shown with a dashed one. The blue line shows statutory pension contributions as percent of GDP.

I agree with the former MP Kimmo Kiljunen in that the capital amassed in pension funds should be used. But I disagree with him when he says that they should be used to increase pensions. Instead, they should be used to keep pension contributions from skyrocketing along with the dependency ratio.

That’s the reason why these funds were accumulated in the first place.

The code to generate the figure above is in this gist.

Posted in R | Comments Off on Finland’s dependency ratio and pension contributions

Finland’s mandatory pension contributions

A couple of weeks ago, I made an animated visualization of the population structure of Finland. Here’s another plot exploring demographic changes, this time coupled with the economy.

Finland has a defined-benefit and earnings-related statutory pension system. Employers are required by law to pay pension contributions of 24.0 % on top of an employee’s gross salary. In addition, the employee is required to pay a contribution of 5.7 %, or if they are 53 years or older, 7.2 %. These contributions are used to pay out the current pension liabilities. (The system is partly funded, but currently more is paid out than collected.)

Here’s a plot of the total contributions as a percentage of the gross domestic product (GDP).


The R code to produce the plot is in this gist.

Posted in R | Comments Off on Finland’s mandatory pension contributions

Population Structure of Finland

Inspired by the blot post Japan’s aging population, animated with R by David Smith and the population pyramid plot by Kyle Walker, I figured I’d try the same for Finland.

I used the pxweb package (by Måns Magnusson, Love Hansson, and Leo Lahti) to pull the corresponding data from Statistics Finland, and plotted it by making some changes to Kyle’s code.


The R code to generate the animation is in this gist.

Posted in R | Comments Off on Population Structure of Finland

Snowplows of Helsinki

As part of the Helsinki Region Infoshare initiative, the city of Helsinki provides an API that shows the locations, routes, and activities of snowplows that are operated by its service provider Stara.

Using that API, Sampsa Kuronen created Aurat kartalla, which is a beautiful visualization of the real-time data. It allows you to specify a time interval, and shows different activities (snow removal, spreading sand, de-icing with salt, etc) with different colors.

I decided to try my own version with shiny, for a couple of reasons:

  1. In addition to identifying different activities, the API also includes a flag specifying “bicycle and pedestrian lanes”. Aurat kartalla always shows them with the same color, therefore not distinguishing between e.g. spreading sand and de-icing with salt. Although I personally don’t really mind, for some cyclists this is important information however. Many have suffered flat tires because of the sand, and many feel that the salt rusts their bikes.
  2. Outside bicycle and pedestrian lanes, Aurat kartalla does show the different activities with different colors. But when there are multiple activities performed on the same route, it can be difficult to tell them apart.
  3. I had never created a shiny app that polls an external API and automatically updates its data, so it was simply an interesting experiment.

Here are links to the resulting shiny app and its source code on GitHub.

Screen Shot 2016-02-03 at 14.50.53

As my goal was to provide granular control to really check what activities had been performed along a specific route, at first I included a separate setting to distinguish between streets and bicycle/pedestrian lanes. However, after looking at the results on a couple of snowy days, I noticed that this flag wasn’t really that reliable. Exact same routes were plowed both with and without it.

I can think of two possible explanations. The first one is that the flag really just specifies the equipment used, and some plows are marked for bicycle/pedestrian lanes, while others are not. And that in reality, both can also operate outside these target routes. The second one is that the presence of the flag relies on the plow driver explicitly specifying when they are plowing a bicycle/pedestrian lane, and that this is simply often forgotten (as, to be honest, I would expect to happen in reality).

Therefore, I removed the separation between streets and bicycle/pedestrian lanes, and instead show both at the same time. But the main point is still to be able to unambiguously distinguish between the different activities that have been performed. However, this goal suffers a bit from the fact that the API doesn’t actually contain all of the plows in use, so there is no way to tell for sure whether something has not been performed.

Nevertheless, it was a fun experiment. And in any case, I think Aurat kartalla provides a more beautiful overall visualization of the same data, and with better performance.

Posted in R, transportation, urban | Comments Off on Snowplows of Helsinki

Mapping my boat trips

Now, in the middle of winter and when I’m a feeling bit under the weather, it’s a perfect moment to reminisce about summer and time spent on the Finnish Archipelago Sea. So, I combined tracking data from the Moves app with some shiny and leaflet code to make an interactive map that shows my boat trips from the last two years.

The source code can be found on GitHub, and the app itself is here.

Screen Shot 2016-01-31 at 18.54.20

Zooming in on the tracks brings back memories from all those legs and marinas, of great sailing and even better company. It lets me relive moments like these.

Posted in R, sailing | Comments Off on Mapping my boat trips

ggplot 2.0 and the missing order aesthetic

Version 2.0 of the popular R package ggplot2 was released three weeks ago. When I was reading the release notes, I largely just skipped over this entry under Deprecated features:

  • The order aesthetic is officially deprecated. It never really worked, and
    was poorly documented.

After all, something that “never really worked” didn’t seem that important. But last night, I realized I had indeed been using it, and now needed to find a workaround.

Now, it seems to me like this was not a very widely used feature, and most people were therefore already using a better solution to achieve the same goal. So, to demonstrate what I mean, let’s create some dummy data, and count how many occurrences of each weekday there were in each month last year:


year_2015 <- data_frame(date=seq(from=as.Date("2015-01-01"), to=as.Date("2015-12-31"), by="day")) %>%
  mutate(month=floor_date(date, unit="month"), weekday=weekdays(date)) %>%
  count(month, weekday)
Source: local data frame [84 x 3]
Groups: month [?]

        month   weekday     n
       (date)     (chr) (int)
1  2015-01-01    Friday     5
2  2015-01-01    Monday     4
3  2015-01-01  Saturday     5
4  2015-01-01    Sunday     4
5  2015-01-01  Thursday     5
6  2015-01-01   Tuesday     4
7  2015-01-01 Wednesday     4
8  2015-02-01    Friday     4
9  2015-02-01    Monday     4
10 2015-02-01  Saturday     4
..        ...       ...   ...
year_2015 %>%
  ggplot(aes(x=month, y=n, fill=weekday)) +


With weekday being character data, by default it is ordered alphabetically, from Friday to Wednesday. But since weekdays of course have a natural order, we can honor that with an ordered factor:

year_2015_factor <- year_2015 %>%
  mutate(weekday=factor(weekday, levels=c("Monday", "Tuesday", "Wednesday",
    "Thursday", "Friday","Saturday", "Sunday"), ordered=TRUE))

year_2015_factor %>%
  ggplot(aes(x=month, y=n, fill=weekday)) +


That takes care of the order in the legend, but not in the plot itself. Prior to version 2.0, it was possible to define the plotting order with the order aesthetic:

year_2015_factor %>%
  ggplot(aes(x=month, y=n, fill=weekday, order=-as.integer(weekday))) +

However, that does not work anymore in version 2.0. As I said above, it seems to me that few people were really using the order aesthetic, and most were probably just taking advantage of the fact that the plotting order is the same order in which the data is stored in the data.frame. In this case, it was the alphabetical order as a consequence of using count(). So, let’s re-order the data.frame and plot again:

year_2015_factor %>%
  ungroup() %>%
  arrange(-as.integer(weekday)) %>%
  ggplot(aes(x=month, y=n, fill=weekday)) +


There we go. Now both the plot and the legend are in the same, natural order.

That’s one way to solve the case where I had been using the order aesthetic. I’m not sure if it applies to all other scenarios and geoms as well.

Posted in R | Comments Off on ggplot 2.0 and the missing order aesthetic

What I Did at the Recurse Center

Last spring, I spent three months at the Recurse Center, which is like a writers’ retreat for programmers. People from all over the world and with very different backgrounds go there to become better programmers. It’s a great environment for self-learning and collaborating with others. Here I’ll briefly outline what I worked on during that time.

For the past ten years, my number one tool at work as been R. As I’ve been focused on cancer research and chromosomal aberrations, also the packages I’ve used have been from this area (especially Bioconductor). For the three months at RC, I decided to work on broadening my skillset to more general-purpose data science tools.

I still worked mostly in R, and learned to use data manipulation packages like data.table (for more efficiency and modifying data in place) and dplyr (more expressive/logical/readable code, at least for someone like me with background in SQL). For visualizations, I learned how to use ggplot2, how to make interactive apps with shiny, and how to draw maps with ggmap and leaflet. And regarding machine learning, I learned how to use the caret package’s unified interface to train and tune various statistical models. I also took Stanford’s online course on Statistical Learning.

To get to know these packages, I worked on three little projects: how different neighborhoods in New York City vary in their Citi Bike usage patterns, what data from the Moves app tells about how I move and where I’ve been, and also made a prediction on who would win the Stanley Cup.

In addition to working with R, I brushed up my Python skills completing the exercises available at Dataquest. I got to know the very basics of libraries like NumPy, pandas, and matplotlib, as my previous Python experience was only from writing basic utility scripts and not really any type of data analysis.

I also listened to many excellent talks on topics such as public speaking, network protocols, UNIX process model and shell programming, Docker/containerization (and updated the server this website is hosted at to use systemd containers), immutability, hashes, and whether artificial intelligence is a threat or not. Books I read included An Introduction to Statistical Learning, ggplot2 – elegant graphics for data analysis, and The Second Machine Age.

In general, my experience at RC was very positive. I learned a lot and was surrounded with very smart people who I could always ask for advice and guidance. To anyone contemplating a batch I would say go.

Posted in Uncategorized | Comments Off on What I Did at the Recurse Center