Were the largest of the PPP loans granted fairly?
The Paycheck Protection Program was initiated in April 2020. Administered by the Small Business Association (SBA), these forgivable loans were meant to relieve small businesses in America some of the potentially devastating strain imparted by the onset of the COVID-19 pandemic. Congress appropriated $649 billion for this program, and over 500 million loans were granted, at an average amount of $111,000.
As you can imagine, this program was enormously popular and there was an immediate feeding frenzy to snap up these loans as soon as the program opened. After all, businesses were suffering and desperate. And who wouldn’t want what will ostensibly become free money?
My physical therapy practice received one of these loans, so I am familiar with the application process and what are meant to be the rules for forgiveness of the loan. The amount of loan money that a business could apply for would be calculated in a couple of different ways, but generally it was to be based upon the average monthly payroll costs over the previous year. In order to qualify to have the loan forgiven, the business was supposed to spend at least 60% of the funds on payroll costs and the rest on rent and utilities, during the course of a certain coverage period of either 8 or 24 weeks after obtaining the loan. The rules for loan forgiveness are still being hashed out in congress, but roughly speaking, if after that coverage period the business can show documentation that the loan money was spend appropriately, that is, that it actually went toward protecting paychecks, the the loan will be forgiven.
In the rush to push out the money to businesses that needed it, a lot of concern arose around whether these loans were applied for legitimately and handed out fairly. Clearly they were not in every case, as described in stories like this. So when I found that the SBA and the U.S. Treasury released their data from the PPP loans that had been granted, I felt it was my duty to dig in and see whether any unsavory patterns emerged.
My analysis focused on the set of all 662,515 loans of $150,000 or more that were granted between April 3 and August 8, 2020. For each of these loans, the amount granted was lumped into one of 5 categories. I decided to examine whether we could reliably guess which companies received loans of $1 million or more, based upon the information collected here.
The features in the data that I decided to work with to make these predictions were the state where the business was located, the type of business, the number of jobs reported as being supported by the loan, the date the loan was approved, the lender (grouped into three categories according to how many loans each lender administered), and the NAICS code category. NAICS codes are 7-digit codes indicating the type of industry a business is in. Since there were thousands of these, I simplified these into their first digits. The digits 1 through 9 now represented broader categories that included many different subtypes of industry, kind of like the Dewey Decimal System. Okay, I may be dating myself with that mention.
I was a bit worried going in, that including the feature “Jobs Reported” would constitute leakage of information from features into the model’s prediction structure. After all, one of the main pieces of information that went onto the PPP loan application was the number of jobs, along with total monthly payroll calculated for these jobs. However, after examining the relationships between the features and the data labels, I decided that the relationship between Jobs Reported and the range of the loan funded was not as straightforward as we would suspect. See Figure 1.
Most companies receiving loans in the smallest category reported trying to support less than 50 jobs. However, even in this category, there were plenty of outliers reporting all the way up to 500 jobs (the maximum allowed to be considered a small business for the purpose of this type of loan). On the other hand, while most of the companies that received loans larger than $5 million reported over 300 jobs, there were some in this category (and all the others) that actually reported 0 jobs! I have not been able to discover what anomaly in the loan process allowed this to happen, or what it meant.
After the data wrangling was done, I split the large data frame into training, validation, and test sets. 87.6% of the companies in the data frame did not receive loans greater than $1 million, so this was established as the baseline accuracy level for our model’s predicting capabilities to beat.
I then fit a logistic regression model to the data, using one-hot encoding for the categorical data, simple imputing for missing values, and standard scaling for comparability of model coefficients. The model that emerged produced an accuracy score of 92.5% on both the training and validation data. The precision for discerning true positives was 83%, and sensitivity (or recall) for picking up on these extra large loans was only 49%.
In order to interpret this model, we’d like to look at the size and sign of the coefficients. We should keep in mind, however, that we should be careful about the meaning of “prediction” in this case. The PPP program is a short-lived phenomenon, so we’re not really that interested in being able to predict how much a company will squeeze out of the federal government in the future at the taxpayer’s expense. While this would be nice to know, these data are more likely to be able to help us appreciate the fairness of the program as it has unfolded thus far. As mentioned, these loans were applied for and dolled out according to the number of FTE’s — “full time equivalents” — a company had on payroll. We know therefore, that if the world is as it should be, jobs reported should be by far the most influential factor in how big a loan a company gets. We are looking for whether there were any other factors that affected the outcome of these loans in a way that we would not want to see. Figure 2 shows the coefficients from the logistic regression model that were larger (in magnitude) than 0.1.
As we would expect and hope, the number of jobs reported had by far the greatest influence on how large a loan a company received, according to this logistic regression model.
Other than number of jobs reported, a business was also slightly more likely to get a loan of > $1 Million if it was in an industry that falls into the NAICS category starting with 2 (energy, mining, construction, contractors), 5 (media, insurance, finance and other services), 3 (manufacturing of all types), 4 (wholesalers, retailers, transportation providers and warehousing/storage), or if they were a non-profit company. Honorable mention went to any company hailing from the state of New York.
Conversely, the strongest feature other than jobs reported was one that seemed to hurt a company’s chances of getting a very large loan, and that was operating in the entertainment or food service industry. In addition, it seems that as dates marched on, very large loans were less likely to be handed out. Of course, this could have had to do with who was applying for loans at each point in time and the features of these companies.
While this model fit fairly well and was intuitive to interpret, it is possible that there were some non-linear relationships between the features presented and the size of loans distributed. I also would have liked to see better operating characteristics. While I liked the interpretability of the logistic regression model, I didn’t like its low recall. We are interested in good recall, so that we can detect the cases where loans of > $1 million were granted, and understand the factors that seemed to be related to that outcome.
On a quest for better sensitivity to extra large loans, I went on to also fit a random forest model. This model produced an accuracy of 98.8% for the training data, and 92.6% for the validation data. It had slightly worse precision than the logistic regression model, at 73%, but slightly better sensitivity, at 63%.
Again, what I was most interested in the way of interpretation of the model, was to make sure that Jobs Reported was the most influential feature on the outcome of loan size. Of course we are also curious as to what other features seemed to influence the loan outcome, and in what way. Figure 3 shows the permutation importances of features in this model, i.e. how much of an effect there would be if we were to randomly permute the values in each of the given variables and then refit the model, permuting the values of one variable at a time.
The good news is that again, Jobs Reported had by far the most influence in the outcome of these loans. At the same time, the NAICS category, i.e. the type of industry a business was in, did also seem to have a non-negligible effect, as did the date that the loan was approved.
The first thing to note is that across all NAICS categories, the probability of obtaining an extra large loan is not monotonically increasing with Jobs Reported. Rather, each category again has a certain number of companies that reported 0 jobs. This may be the most problematic finding of all, and I am apparently not the first person to notice, as you can see by reading this and this.
In addition, companies in certain industry categories obtained smaller loans for the number of jobs they reported than others did. This interaction effect is most dramatic in the “7” category - sports teams, cultural entities and restaurants - in which a company was not likely to have received a loan of greater than $1 million unless they had the maximum number, 500, jobs reported. Contrast that with the “2” category - energy, mining and contractors - in which businesses received these extra large loans at a rate of over 70% once they reported 97 jobs.
We know that we are living in economic times that are unprecedented for our generation and at least one or two before us. The role of the federal government is to step in quickly and dramatically to bolster the economy when disaster hits, and the PPP program was meant to do just that. At the same time, pumping so much money into small businesses so quickly is bound to come at a cost where it comes to the ability to oversee proper distribution of that money.
The code for this analysis can be found here.