MS&E 226 Small Data class project in Fall 2017. The project was to use and analyze a data set to practice techniques learned in class. I worked with a classmate to explore and analyze Kickstarter data using R. Our data set contains the statistics of every project (n=132609) posted on Kickstarter, the Internet’s largest crowdfunding platform, between the site’s inception in April 2009 and February 2014.
Part 1: Data Exploration
The first step was exploring and describing the data set.
Structure and Covariates:
Our dataset itself has 22 covariates, including (but not limited to) the below. To compare easily with an actual project’s, Cards Against Humanity’s infamous Kickstarter project has its values listed in parentheses.
- Numeric variables: goal amount ($4,000), pledged amount ($15,570), number of backers (758), duration of project (30 days), Facebook likes (null — no FB page connected)
- Categorical variables: state (successful), top-level category (Games), subcategory (Tabletop Games), Location (Chicago, IL), Video on Page? (Yes)
Response Variables and Prediction
A very natural binary response variable is the project’s State (generally successful or failed), also equal to the ratio of Pledged/Goal >= 1. This data set lends itself very intuitively as a means to predict which covariates are more associated with a project’s success or failure — and the magnitude of the funding received given a success.
In addition to the project’s state, natural continuous response variables would be Pledged (the total amount pledged to the project) and the ratio of Pledged/Goal. These are both continuous variations of the binary state; however, they would lend themselves to predicting the actual amount of success and the ability of a project to gain more than their goal amount.
Analyzing Covariates and Correlations
Of the numeric covariates Updates, Comments, Facebook.Shares, and Facebook.Likes, we have the suspicion all of these are strongly positively correlated with the continuous response variable Pledged. The first two indicate strong communication between the entrepreneurs and their audience, and the last two are evidence of a passionate social media audience for the project. Here are scatterplots testing this hypothesis, along with correlations between some of the relevant variables and each other:
Facebook.Shares unsurprisingly is quite correlated with Pledged, as is Backers. Interestingly, the covariates with higher correlations for our continuous response variable Pledged don’t always have a similarly high correlation with the binary response variable State. Part of our research should focus on the relationship between State and Pledged/Goal, and how sensitive they both are to changes in other covariates.
Interestingly, Goal, Rewards and Duration do not appear to be strongly correlated with the either State or with Pledged. Intuition might imply that a smaller Goal amount, more Rewards, or a longer Duration would be associated with greater success; however, this is unsupported by the correlations.
Part 2: Analysis
The second step was to analyze the data for the best prediction methods. There are two natural response variables to predict based on the Kickstarter data: Pledged (Continuous Regression) and State (Classification). This step analyzed different models for use in predicting these responses.
Based on our data there is one clear continuous response variable to predict: Pledged. Pledged amount does not directly measure the success of a project; however, successfully predicting the Pledged amount would not only indirectly measure success but also help a project predict how successful it would be. As a prediction model of this sort could be useful in guiding a project towards success, an interpretable model could be useful. To make a model interpretable the covariates can be normalized before running regression.
To predict, OLS was determined as the ideal regression type due to its simplicity, ease of implementation and correlations found in Mini-Project Part 1. For OLS to effectively predict, both the response and the covariates need to be approximately normally distributed. The continuous covariates, including Pledged, of the Kickstarter data, however, were skewed. As a result, a natural transformation was to Logarithmically Transform the response variable Pledged and possibly the covariates. Logarithmically transforming the response variable, however, changed the units of error calculated by the model. While the RMSE in original units could be calculated to compare against models that did not transform the response, for our purposes, the reported RMSE gave a better estimate of success. The reported RMSE calculated the root mean square of log(Predicted/Actual). This measurement gives a better sense of how well a specific model will perform in terms of percentage. More effective models will have a RMSE closer to 1 and models that under predict will have values less than 1. In addition, RMSE of training data can be impacted by the number of covariates used, as a result Cross-Validation was used on all models to estimate a better RMSPE of the model. This also allows for the RMSPE to be an estimate of the expected test error.
As expected based on the non-linearity of data, not transforming the continuous covariates or only normalizing them caused significantly higher percentage differences from the predicted values to the actual values of Pledged.
Of all of the models and covariate transformations, the subset found with LASSO with logarithmically transformed covariates gave the estimate of test error closest to 1 at 0.944705.
Part 3: Prediction
The third was analyzing our prediction methods.
Prediction on Test Set
The best-performing regression model from Mini-Project Part 2 was the OLS model that included these covariates: Top.Category, Updates, Comments, Facebook.Shares, and Backers. Our linear regression test error is log(YY).This is a percentage that should be as close to 1.00 (100%) as possible. Shown in the table below are the test error calculated using cross-validation on the training data and the actual test error. Our actual test error is 0.3% better (closer to 100%) than our predicted error.
Similarly, we decided to use our best-performing logistic regression model as our classification model to test. This model was the one that used a small subset of covariates — Updates, Backers, Goal, and Has.Video — as its only predictors of State. Our benchmarks for model success were (1) a minimal 0-1 loss and (2) a low false positive (Type I error) rate. We estimated that performance on our test set would approach the 0-1 loss of ~10% (accuracy of ~90%) that we saw in validation, and speculated that the false positive rate might be slightly higher in the test (7.57%).
Impressively, our speculation about the prediction was right on the money as well. Here are confusion matrices for the original data the model was trained on, and the test set:
The confusion matrix of our model on the test set is remarkably similar to that of the training set. We ultimately saw a 0-1 loss of 9.12% (accuracy of 90.88%), which actually represented a small improvement on the training error. Our type I error rate increased slightly to 7.74%, within the realm of our expectation but not a dramatic increase.
Part 4: Inference
The fourth and final part to our project was applying inference to chosen models.
Selecting a Candidate Model to Analyze
For inference, we chose to analyze our top-performing linear regression model for a few reasons, namely:
- That model included a binarized qualitative covariate, Top.Category, where our logistic regression model didn’t. This meant that we would be able to analyze the coefficients for each of the 12 mutually exclusive Top.Category values, like Music and Photography — potentially offering some insight on a project category level we wouldn’t otherwise see.
- We felt that since our linear regression model was prediction of a specific value (the log of the pledged dollar-goal dollar ratio), the coefficients could be interpreted intuitively as a percent increase of those units — unlike the less intuitive coefficients of our logistic regression model.
Analysis of Chosen Model’s Coefficients
In the chosen linear regression model, R’s default output for the lm() method considers Updates, Comments, Facebook.Shares, Backers, and some of the categories within Top.Category significant. Significance results for covariates that we’ve mentioned specifically are highlighted in blue. In comparing the significance results on the test data, the main 4 covariates Updates, Comments, Facebook.Shares, and Backers stayed significant. All maintained p-values significant at the p = 0.001 level. Updates had a p-value grow from 2e-16 to 7.48e-14, however, as this is still less than 0.001, this change is likely just due to the randomness of the data.
Coefficient Confidence Intervals
Based on the number of covariates included in chosen model, running a CI estimation using the bootstrap wasn’t feasible with the full sample size of our test set (n = 23,000), but we retrieved the following results using n = 1000:
Our one interesting takeaway here was that the computed SE-hats on some Top.Category values were larger than others, notably Technology and Dance. This might indicate that the bootstrap detected more variability in the random data points it selected in this run. The non-binarized covariates saw very small CIs, indicating that the bootstrap produced a prediction consistent with the linear model.
Potential Problems in our Analysis
In retrospect, we were able to identify some red flags in our assumptions and in the models we’ve analyzed. Specifically, these fall into a couple main categories:
- Post-selection inference. As mentioned previously, by using LASSO to select covariates in our linear regression model, we definitionally selected covariates that would have larger significance. Testing the significance of covariates on the test data is a method for counter-acting our post-selection inference problems. Our covariate selection biased our significance results on the training data, however, by testing on an additional data set, true significance levels can be found.
- Potential collinearity of certain covariates. We noticed in our linear and logistic regression models that the Backers covariate had a high degree of correlation with Facebook.Shares, Comments, and Updates. This is intuitively true, since Comments and Facebook.Shares are actions performed by Backers, so the number of backers will generally be proportional to the number of actions they make. It’s difficult to divorce the association with causation here (do comments contribute to success, with shares being a side effect? Or do shares contribute to success, with comments being the side effect?) We found that these associations were more interesting than damaging to the model, but couldn’t effectively rule out some degree of collinearity between that set of covariates. Similarly, the binary Facebook.Connected covariate wasn’t directly collinear with Facebook.Shares, but Facebook.Shares did “rely” on it (as the Facebook.Shares covariate would have to be 0 when Facebook.Connected was also 0). This relationship isn’t much different than the mutual exclusivity of the Top.Category covariates, though, we considered that including both in the model with all covariates would have minimal negative effect.
- Fit models by category. An interesting use of this data could be to look at the data for a specific Top.Category. We theorize that a set of projects in a category can be thought of as a population of own. Thinking this way would allow better use of the more-detailed Sub.Category covariate, and would be micro-targeted to the use case of a project founder with a specific type of project.
- Fit models by target goal amount, as opposed to pledged amount. This could potentially better fit the data as it would minimize outliers, which exist less frequently for Goal than for Pledged (since founders control it).
- Test for interactions between covariates. As mentioned above, the model does not incorporate how covariates might be associated. For example, Backers and Facebook.Shares are correlated with each other as well as being correlated with Pledged. At the moment, the linear regression model assumes the covariates are independent, and could be interpreted as a 1% increase in Backers will have a x% increase in Pledged. Adding interaction terms would help account for how movement in one covariate is associated with movement in another covariate. This might create a more accurate model, but it would also help ensure that the interaction is being considered when interpreting the model.
- Differentiate user types. Platforms often have “super-users”, who have a large impact on the platform. In the case of Kickstarter, these could be project owners who have a history of very successful projects. It could be very interesting to separate these users from users who are starting their first project. This would be interesting, as “super users” potentially have a network already built to support the success of a Kickstarter project. As a result, the actions to make their project successful may be different than that of a first time project starter. This is just one example; however, there are likely many ways in which an experienced user will interact differently. Creating models for both a first time user and for a super user could potentially make both models more accurate but might also give interesting insights when comparing the significant covariates of each model.