The History of RegData

Patrick McLaughlin initially conceived RegData as a way of improving the measurement of regulations and the regulatory process. He quickly partnered with George Mason University economics professor Omar Al-Ubaydli to create the first two versions of RegData, and later he brought Oliver Sherouse into the RegData Project to develop the more recent iterations of the database. In years past, regulation was a phenomenon that largely went unmeasured, rendering discussions and research related to the regulatory process qualitative and abstruse. On those rare occasions when regulation was quantified, researchers might measure it by counting pages published in the Federal Register. The problems with measuring regulation in this way are well documented: in addition to being rather noisy because many pages have nothing to do with regulation, this measurement method also runs the risk of counting deregulation as an increase in regulation because deregulation requires the publication of pages in the Federal Register. Even more important, not all regulations are created equal in their effects on different sectors of the economy or on the economy as a whole. One page of regulatory text can be quite different from another in content and consequence. So measuring regulation by counting pages misses a lot of detail that could be useful in understanding the causes and effects of regulation.

RegData introduced an objective, replicable, and transparent methodology for measuring regulation. RegData improved on existing measures of regulation in two principal ways:

RegData quantifies regulations based on the actual content of regulatory text. In other words, our custom-made program examines the regulatory text itself, counting the number of binding constraints or “restrictions,” words that indicate an obligation to comply such as “shall” or “must.” This creates a more precise metric because some regulatory programs can be hundreds of pages long with relatively few restrictions, while others have only a few paragraphs with a relatively high number of restrictions.
RegData quantifies regulations by industry. We assess the probability that a given regulatory restriction is targeting a specific industry, and this allows us to create industry-specific measures of regulation over time. RegData uses the same industry classes as the North American Industrial Classification System (NAICS), which categorizes and describes each industry in the US economy. Using industry-specific quantifications of regulation, users can examine the growth of regulation relevant to a particular industry over time or compare growth rates across industries.

Industry-specific measurements of regulation can be used in several ways. Both the causes and the effects of regulation can differ from one industry to the next, and with quantified regulations for all industries users can test whether industry characteristics—such as industry growth, dynamism, employment, or a penchant for lobbying—are connected to industry-specific regulation levels.

We have been transparent about our methodology even as it has evolved over time, and to date we have released five different iterations of the RegData database.

Version 4.0 (1970–2020) — Released May 2021 (Current Version)

RegData 4.0 introduced a new method of counting regulatory restrictions to include obligations or prohibitions previously hidden in lists or bullet points. In addition, like other iterations, RegData 4.0 expanded the dataset an additional year to cover the years 1970 to 2020. RegData 4.0 also added some accuracy improvements by cleaning older text and better parsing agency data. Please see the RegData U.S. 4.0 User’s Guide for details on these improvements.

Version 1.0 (1997–2010)

Released in 2012, RegData 1.0 introduced the restrictions metric—the method of measuring regulation by counting words such as “shall” and “must” within regulatory text. It also introduced the idea of creating industry-specific measures of regulation. These industry metrics of regulation were based on a human-assisted search algorithm, which involved creating a set of search terms or keywords based on the descriptions of specific industries in the North American Industry Classification System. The data in version 1.0 covered the years 1997–2010 and included 2- and 3-digit NAICS-coded industries.

Version 2.0 (1997–2012)

RegData 2.0 provided the ability to quantify the regulations that specific federal regulators (including agencies, offices, bureaus, commissions, or administrations) have produced. For example, with version 2.0, a user could see how many restrictions a specific administration of the Department of Transportation (e.g., the National Highway Traffic Safety Administration) has produced in each year. It also added the years 2011 and 2012 to the database. Version 2.0 was bundled with a new dataset that calculated the probabilities of specific industry search terms occurring in written English, compiled using the Google NGram Viewer’s underlying database. It also added 4-digit NAICS-coded industries.

Version 2.1 (1975–2013)

RegData 2.1 introduced machine-learning algorithms to the project. While versions 1.0 and 2.0 had relied upon search terms, devised using a scheme initially conceptualized by McLaughlin and Al-Ubaydli that created permutations of individual industry descriptions, the algorithms used in RegData 2.1 did not require humans to tell the program what specific words or phrases to search for. Instead, we found thousands of documents that we knew related to specific industries and used those documents to train our programs. Our programs parsed the training documents and identified which words and phrases were used in reference to specific industries. This enhancement permitted industry-specific classification of regulation to be much more accurate, primarily by avoiding more false positives.

RegData 2.1 added several more years’ data, covering 1975 to 2013, and included 3-digit NAICS-coded industries. It also introduced the public law database (PLDB), which mapped specific regulations to their authorizing statutes from 1980 to 2013.

Version 2.2 (1975–2014)

RegData 2.2 included significant refinements in the machine-learning algorithm used to classify regulations by industry. We also expanded the machine-learning-based dataset to include 2- and 4-digit NAICS-coded industries and added the year 2014 to both the regulations data and the PLDB so that RegData 2.2 covers 1975–2014 and the PLDB covers 1980–2014.

Version 3.0 (1970–2016)

RegData 3.0 for the first time covers all levels of the NAICS standard, from 2 to 6 digits. This version also broadens the scope of the dataset backward to 1970 and forward to 2016. Additionally, the machine-learning model has been further improved to better identify related industries that are regulated as a group. The PLDB has been separated out from the main release to allow for different updating schedules.

Version 3.1 (1970–2017)

RegData 3.1 expanded the dataset an additional year to 2017, using the XML representation of the Electronic CFR. The eCFR has been annualized by choosing the last-modified version of each title of that title’s publication date in the annual CFR publication cycle. In addition, RegData 3.1 changed the way agency names are parsed during the creation of the CFR corpus, improving both the accuracy and the speed of the creation of the corpus.

Version 3.2 (1970–2019)

RegData 3.2 expanded the main RegData dataset two more years to 2019. In addition, RegData 3.2 included metrics for determining the complexity of regulatory text, including number of conditional terms or phrases, average sentence length, reading grade level, and Shannon Entropy.