Data Replicability

Posted on Tue 13 March 2018 in Analysis

I thought that Ben Falk had a great post on his excellent website Cleaning the Glass. In it, he talks about how the papers at Sloan1 are getting less replicable due to researchers not releasing their data sources. This lack of open source data is stifling the open source community; any new analysis is going to be hampered by the data that they have available to them. To reiterate: although analytics formed from public mailing lists, analytics innovation is now being driven by companies and analysts inside sports teams, because these companies are scared of giving up a strategic or monetary advantage.

This completely checks out in the economic culture - open source has been studied in innovation economics and has several benefits that don’t appear on a balance sheet. These benefits would be called positive externalities. For example, imagine if a team released a new type of data to the public. While at first glance this is not a smart business move, it could lead to a no-name analyst putting out ground breaking analysis on Reddit, who could then be hired for a job. I know that teams are always looking for great analysts to hire, and this could be a cheap way to do that. Or perhaps this data could be remixed by the open source community, and could jumpstart entirely new trends in the field of analytics. These benefits are not obvious at first glance; however, they are valuable and should be considered in the cost/benefit analysis.


  1. The Sloan Sports Analytics Conference