About

What is QuantGov

QuantGov is a free collection of tools and resources for researchers and policymakers alike.

Federal, state, and local governments produce enormous amounts of policy text, including the text of laws, regulations, trade agreements, treaties, court decisions, and even public speeches. It is almost impossible for any one human, or team of humans, to gain insights from such large bodies of text. For example, it would take the average person over three years to read the entire US Code of Federal Regulations (CFR), which contains more than 100 million words and counting. And that is just the reading component. Not comprehension, not quantification, and not analysis.

QuantGov solves the problem of quantifying large amounts of policy text for research and comprehension by using machine learning and natural language processing.

At its core, QuantGov is an open-source policy analytics platform designed to help create greater understanding and analysis of the breadth of government actions through quantifying policy text. By using the platform, researchers can quickly and effectively retrieve unique data that lies embedded in large bodies of text - data on text complexity, part of speech metrics, topic modeling, etc. The platform opens up new lines of research in areas rarely explored before. Helping everyone from the policymaker to the professor gain a broad, yet detailed picture of government policies, allowing for scientific testing assumptions of these policies’ causes and effects.

What Can QuantGov Actually Do?

QuantGov is a tool designed to make policy text more accessible. Think about it in terms of a hyper-powerful Google search that not only finds (1) specified content within massive quantities of text, but (2) also finds patterns and groupings and can even make predictions about what is in a document. Some recent use cases include the following:

Analyzing state regulatory codes and predicting which parts of those codes are related to occupational licensing….And predicting which occupation the regulation is talking about….And determining the cost to receive the license.
Analyzing Canadian province regulatory code while grouping individual regulations by industry-topic….And determining which Ministers are responsible for those regulations….And determining the complexity of the text for those regulation.
Quantifying the number of tariff exclusions that exists due to the Trade Expansion Act of 1962 and recent tariff polices….And determining which products those exclusions target.
Comparing the regulatory codes and content of 46 US states, 11 Canadian provinces, and 7 Australian states….While using consistent metrics that can lead to insights that provide legitimate policy improvements.

What Can You Do With QuantGov Products?

Below is a video that describes how an individual can use RegData, a flagship QuantGov product, to answer questions about regulations.

FAQs

+ Where do I start?

First, you need to determine what your end goal actually is. Do you want to download data already produced by QuantGov? Do you want to compare across different jurisdictions? Interested in using QuantGov to build your own data set? Click here or contact us directly. Are you a member of the media interested in learning more or speaking with an analyst? Please email media@mercatus.org.

+ Why is it so hard to comply with government rules and regulations?

From our research, it would take the average person more than 3 years to read the entire U.S. Code of Federal Regulations (CFR). And that’s just reading: not comprehending, quantifying, or analyzing. QuantGov wants to simplify this process for policymakers, the academy, and businesses.

+ Can I use QuantGov on my own documents?

Yes. You can use our custom QuantGov Python library to analyze your own documents.

+ What is RegData?

RegData is a sub-project of QuantGov. It is the specific process of using the QuantGov platform to collect, quantify, and analyze regulatory code.

+ Can I trust machine learning, natural language processing, and artificial intelligence?

The reality of using these tools on text is much closer to what you experience in a college statistics course than what was portrayed in The Matrix (although it was a fantastic movie). ML, AI, and NLP represent the cutting edge of computer science and have proven to be reliable tools for advancing scientific frontiers far past human capabilities. That being said, obviously nothing is perfect. We recommend that all users read the documentation associated with our data and have a well researched understanding of the specific topics that they are using our data to analyze. You may also want to use our QuantGov Python library to analyze your documents.

+ Can the QuantGov team analyze my documents?

Possibly. You may also be able to analyze your documents yourself using our QuantGov Python library. If not, please contact us directly about this question.

+ Have the data produced by QuantGov been used by other academics or researchers?

Hundreds of researchers and academics have used and cited RegData and QuantGov in academic papers, policy studies, and across many media platforms. In addition, QuantGov products have been used by dozens of governments and their associated agencies/departments. You can see some recent citations here.

+ What does open-source mean?

Open-source means that all of the code that backs our QuantGov platform is public and free to use and download.

+ What is the history of RegData?

Patrick McLaughlin initially conceived RegData as a way of improving the measurement of regulations and the regulatory process. He quickly partnered with George Mason University economics professor Omar Al-Ubaydli to create the first two versions of RegData, and later he brought Oliver Sherouse into the RegData Project to develop the more recent iterations of the database. In years past, regulation was a phenomenon that largely went unmeasured, rendering discussions and research related to the regulatory process qualitative and abstruse. On those rare occasions when regulation was quantified, researchers might measure it by counting pages published in the Federal Register. The problems with measuring regulation in this way are well documented: in addition to being rather noisy because many pages have nothing to do with regulation, this measurement method also runs the risk of counting deregulation as an increase in regulation because deregulation requires the publication of pages in the Federal Register. Even more important, not all regulations are created equal in their effects on different sectors of the economy or on the economy as a whole. One page of regulatory text can be quite different from another in content and consequence. So measuring regulation by counting pages misses a lot of detail that could be useful in understanding the causes and effects of regulation.

RegData introduced an objective, replicable, and transparent methodology for measuring regulation. RegData improved on existing measures of regulation in two principal ways:

RegData quantifies regulations based on the actual content of regulatory text. In other words, our custom-made program examines the regulatory text itself, counting the number of binding constraints or “restrictions,” words that indicate an obligation to comply such as “shall” or “must.” This creates a more precise metric because some regulatory programs can be hundreds of pages long with relatively few restrictions, while others have only a few paragraphs with a relatively high number of restrictions.

RegData quantifies regulations by industry. We assess the probability that a given regulatory restriction is targeting a specific industry, and this allows us to create industry-specific measures of regulation over time. RegData uses the same industry classes as the North American Industrial Classification System (NAICS), which categorizes and describes each industry in the US economy. Using industry-specific quantifications of regulation, users can examine the growth of regulation relevant to a particular industry over time or compare growth rates across industries.

Industry-specific measurements of regulation can be used in several ways. Both the causes and the effects of regulation can differ from one industry to the next, and with quantified regulations for all industries users can test whether industry characteristics—such as industry growth, dynamism, employment, or a penchant for lobbying—are connected to industry-specific regulation levels.

We have been transparent about our methodology even as it has evolved over time, and to date we have released five different iterations of the RegData database.

Version 4.0 (1970–2020) — Released May 2021 (Current Version) RegData 4.0 introduced a new method of counting regulatory restrictions to include obligations or prohibitions previously hidden in lists or bullet points. In addition, like other iterations, RegData 4.0 expanded the dataset an additional year to cover the years 1970 to 2020. RegData 4.0 also added some accuracy improvements by cleaning older text and better parsing agency data. Please see the RegData U.S. 4.0 User’s Guide for details on these improvements.

Version 1.0 (1997–2010) Released in 2012, RegData 1.0 introduced the restrictions metric—the method of measuring regulation by counting words such as “shall” and “must” within regulatory text. It also introduced the idea of creating industry-specific measures of regulation. These industry metrics of regulation were based on a human-assisted search algorithm, which involved creating a set of search terms or keywords based on the descriptions of specific industries in the North American Industry Classification System. The data in version 1.0 covered the years 1997–2010 and included 2- and 3-digit NAICS-coded industries.

Version 2.0 (1997–2012) RegData 2.0 provided the ability to quantify the regulations that specific federal regulators (including agencies, offices, bureaus, commissions, or administrations) have produced. For example, with version 2.0, a user could see how many restrictions a specific administration of the Department of Transportation (e.g., the National Highway Traffic Safety Administration) has produced in each year. It also added the years 2011 and 2012 to the database. Version 2.0 was bundled with a new dataset that calculated the probabilities of specific industry search terms occurring in written English, compiled using the Google NGram Viewer’s underlying database. It also added 4-digit NAICS-coded industries.

Version 2.1 (1975–2013) RegData 2.1 introduced machine-learning algorithms to the project. While versions 1.0 and 2.0 had relied upon search terms, devised using a scheme initially conceptualized by McLaughlin and Al-Ubaydli that created permutations of individual industry descriptions, the algorithms used in RegData 2.1 did not require humans to tell the program what specific words or phrases to search for. Instead, we found thousands of documents that we knew related to specific industries and used those documents to train our programs. Our programs parsed the training documents and identified which words and phrases were used in reference to specific industries. This enhancement permitted industry-specific classification of regulation to be much more accurate, primarily by avoiding more false positives.

RegData 2.1 added several more years’ data, covering 1975 to 2013, and included 3-digit NAICS-coded industries. It also introduced the public law database (PLDB), which mapped specific regulations to their authorizing statutes from 1980 to 2013.

Version 2.2 (1975–2014) RegData 2.2 included significant refinements in the machine-learning algorithm used to classify regulations by industry. We also expanded the machine-learning-based dataset to include 2- and 4-digit NAICS-coded industries and added the year 2014 to both the regulations data and the PLDB so that RegData 2.2 covers 1975–2014 and the PLDB covers 1980–2014.

Version 3.0 (1970–2016) RegData 3.0 for the first time covers all levels of the NAICS standard, from 2 to 6 digits. This version also broadens the scope of the dataset backward to 1970 and forward to 2016. Additionally, the machine-learning model has been further improved to better identify related industries that are regulated as a group. The PLDB has been separated out from the main release to allow for different updating schedules.

Version 3.1 (1970–2017) RegData 3.1 expanded the dataset an additional year to 2017, using the XML representation of the Electronic CFR. The eCFR has been annualized by choosing the last-modified version of each title of that title’s publication date in the annual CFR publication cycle. In addition, RegData 3.1 changed the way agency names are parsed during the creation of the CFR corpus, improving both the accuracy and the speed of the creation of the corpus.

Version 3.2 (1970–2019) RegData 3.2 expanded the main RegData dataset two more years to 2019. In addition, RegData 3.2 included metrics for determining the complexity of regulatory text, including number of conditional terms or phrases, average sentence length, reading grade level, and Shannon Entropy.