## Purpose:

MS&E 226 Small Data class project in Fall 2017. The project was to use and analyze a data set to practice techniques learned in class. I worked with a classmate to explore and analyze Kickstarter data using R. Our data set contains the statistics of every project (n=132609) posted on Kickstarter, the Internet’s largest crowdfunding platform, between the site’s inception in April 2009 and February 2014.

## Process:

### Part 1: Data Exploration

The first step was exploring and describing the data set.

#### Structure and Covariates:

Our dataset itself has 22 covariates, including (but not limited to) the below. To compare easily with an actual project’s, Cards Against Humanity’s infamous Kickstarter project has its values listed in parentheses.

- Numeric variables: goal amount ($4,000), pledged amount ($15,570), number of backers (758), duration of project (30 days), Facebook likes (null — no FB page connected)
- Categorical variables: state (successful), top-level category (Games), subcategory (Tabletop Games), Location (Chicago, IL), Video on Page? (Yes)

#### Response Variables and Prediction

A very natural binary response variable is the project’s **State** (generally successful or failed), also equal to the ratio of Pledged/Goal >= 1. This data set lends itself very intuitively as a means to predict which covariates are more associated with a project’s success or failure — and the magnitude of the funding received given a success.

In addition to the project’s state, natural continuous response variables would be **Pledged** (the total amount pledged to the project) and the ratio of **Pledged/Goal**. These are both continuous variations of the binary state; however, they would lend themselves to predicting the actual amount of success and the ability of a project to gain more than their goal amount.

**Analyzing Covariates and Correlations**

Of the numeric covariates **Updates**, **Comments**, **Facebook.Shares**, and **Facebook.Likes**, we have the suspicion all of these are strongly positively correlated with the continuous response variable **Pledged**. The first two indicate strong communication between the entrepreneurs and their audience, and the last two are evidence of a passionate social media audience for the project. Here are scatterplots testing this hypothesis, along with correlations between some of the relevant variables and each other:

**Facebook.Shares** unsurprisingly is quite correlated with **Pledged**, as is **Backers**. Interestingly, the covariates with higher correlations for our continuous response variable **Pledged** don’t always have a similarly high correlation with the binary response variable **State**. Part of our research should focus on the relationship between **State **and **Pledged**/**Goal**, and how sensitive they both are to changes in other covariates.

Interestingly, **Goal, Rewards **and **Duration **do not appear to be strongly correlated with the either **State **or with **Pledged**. Intuition might imply that a smaller **Goal** amount, more **Rewards**, or a longer **Duration** would be associated with greater success; however, this is unsupported by the correlations.

### Part 2: Analysis

The second step was to analyze the data for the best prediction methods. There are two natural response variables to predict based on the Kickstarter data: **Pledged** (Continuous Regression) and **State** (Classification). This step analyzed different models for use in predicting these responses.

#### Regression Model

Based on our data there is one clear continuous response variable to predict: **Pledged**. **Pledged **amount does not directly measure the success of a project; however, successfully predicting the Pledged amount would not only indirectly measure success but also help a project predict how successful it would be. As a prediction model of this sort could be useful in guiding a project towards success, an interpretable model could be useful. To make a model interpretable the covariates can be normalized before running regression.

To predict, OLS was determined as the ideal regression type due to its simplicity, ease of implementation and correlations found in Mini-Project Part 1. For OLS to effectively predict, both the response and the covariates need to be approximately normally distributed. The continuous covariates, including **Pledged**, of the Kickstarter data, however, were skewed. As a result, a natural transformation was to **Logarithmically Transform** the response variable **Pledged** and possibly the covariates. Logarithmically transforming the response variable, however, changed the units of error calculated by the model. While the RMSE in original units could be calculated to compare against models that did not transform the response, for our purposes, the reported RMSE gave a better estimate of success. The reported RMSE calculated the root mean square of log(Predicted/Actual). This measurement gives a better sense of how well a specific model will perform in terms of percentage. More effective models will have a RMSE closer to 1 and models that under predict will have values less than 1. In addition, RMSE of training data can be impacted by the number of covariates used, as a result **Cross-Validation** was used on all models to estimate a better RMSPE of the model. This also allows for the RMSPE to be an estimate of the expected test error.

As expected based on the non-linearity of data, not transforming the continuous covariates or only normalizing them caused significantly higher percentage differences from the predicted values to the actual values of **Pledged**.

Of all of the models and covariate transformations, the subset found with LASSO with logarithmically transformed covariates gave the estimate of test error closest to 1 at 0.944705.

#### Classification

### Part 3: Prediction

The third was analyzing our prediction methods.

#### Prediction on Test Set

The best-performing regression model from Mini-Project Part 2 was the OLS model that included these covariates: **Top.Category, Updates, Comments, Facebook.Shares, **and **Backers. **Our linear regression test error is log(YY).This is a percentage that should be as close to 1.00 (100%) as possible. Shown in the table below are the test error calculated using cross-validation on the training data and the actual test error. Our actual test error is 0.3% better (closer to 100%) than our predicted error.

Similarly, we decided to use our best-performing logistic regression model as our classification model to test. This model was the one that used a small subset of covariates — **Updates**, **Backers, Goal**, and **Has.Video** — as its only predictors of **State**. Our benchmarks for model success were (1) a minimal **0-1 loss** and (2) a low false positive (**Type I error**) rate. We estimated that performance on our test set would approach the 0-1 loss of ~10% (accuracy of ~90%) that we saw in validation, and speculated that the false positive rate might be slightly higher in the test (7.57%).

Impressively, our speculation about the prediction was right on the money as well. Here are confusion matrices for the original data the model was trained on, and the test set:

The confusion matrix of our model on the test set is remarkably similar to that of the training set. We ultimately saw a **0-1 loss **of 9.12% (accuracy of **90.88%**), which actually represented a small improvement on the training error. Our type I error rate increased slightly to **7.74%**, within the realm of our expectation but not a dramatic increase.

### Part 4: Inference

The fourth and final part to our project was applying inference to chosen models.

#### Selecting a Candidate Model to Analyze

For inference, we chose to analyze our top-performing linear regression model for a few reasons, namely:

- That model included a binarized qualitative covariate,
**Top.Category**, where our logistic regression model didn’t. This meant that we would be able to analyze the coefficients for each of the 12 mutually exclusive**Top.Category**values, like**Music**and**Photography**— potentially offering some insight on a project category level we wouldn’t otherwise see. - We felt that since our linear regression model was prediction of a specific value (the log of the pledged dollar-goal dollar ratio), the coefficients could be interpreted intuitively as a percent increase of those units — unlike the less intuitive coefficients of our logistic regression model.

#### Analysis of Chosen Model’s Coefficients

In the chosen linear regression model, R’s default output for the *lm() *method considers **Updates, Comments, Facebook.Shares, Backers**, and some of the categories within **Top.Category** significant. Significance results for covariates that we’ve mentioned specifically are highlighted in **blue**. In comparing the significance results on the test data, the main 4 covariates **Updates, Comments, Facebook.Shares, **and **Backers** stayed significant. All maintained p-values significant at the p = 0.001 level. **Updates **had a p-value grow from 2e-16 to 7.48e-14, however, as this is still less than 0.001, this change is likely just due to the randomness of the data.

#### Coefficient Confidence Intervals

Based on the number of covariates included in chosen model, running a CI estimation using the bootstrap wasn’t feasible with the full sample size of our test set (*n *= 23,000), but we retrieved the following results using *n *= 1000:

Our one interesting takeaway here was that the computed SE-hats on some **Top.Category** values were larger than others, notably **Technology **and **Dance**. This might indicate that the bootstrap detected more variability in the random data points it selected in this run. The non-binarized covariates saw very small CIs, indicating that the bootstrap produced a prediction consistent with the linear model.

#### Potential Problems in our Analysis

In retrospect, we were able to identify some red flags in our assumptions and in the models we’ve analyzed. Specifically, these fall into a couple main categories:

- Post-selection inference. As mentioned previously, by using LASSO to select covariates in our linear regression model, we
*definitionally*selected covariates that would have larger significance. Testing the significance of covariates on the test data is a method for counter-acting our post-selection inference problems. Our covariate selection biased our significance results on the training data, however, by testing on an additional data set, true significance levels can be found. - Potential collinearity of certain covariates. We noticed in our linear and logistic regression models that the
**Backers**covariate had a high degree of correlation with**Facebook.Shares**,**Comments**, and**Updates**. This is intuitively true, since**Comments**and**Facebook.Shares**are actions performed by**Backers**, so the number of backers will generally be proportional to the number of actions they make. It’s difficult to divorce the association with causation here (do comments contribute to success, with shares being a side effect? Or do shares contribute to success, with comments being the side effect?) We found that these associations were more interesting than damaging to the model, but couldn’t effectively rule out some degree of collinearity between that set of covariates. Similarly, the binary**Facebook.Connected**covariate wasn’t directly collinear with**Facebook.Shares**, but**Facebook.Shares**did “rely” on it (as the**Facebook.Shares**covariate would have to be 0 when**Facebook.Connected**was also 0). This relationship isn’t much different than the mutual exclusivity of the**Top.Category**covariates, though, we considered that including both in the model with all covariates would have minimal negative effect.

### Lessons Learned

- Fit models by category. An interesting use of this data could be to look at the data for a specific
**Top.Category**. We theorize that a set of projects in a category can be thought of as a population of own. Thinking this way would allow better use of the more-detailed**Sub.Category**covariate, and would be micro-targeted to the use case of a project founder with a specific type of project. - Fit models by target goal amount, as opposed to pledged amount. This could potentially better fit the data as it would minimize outliers, which exist less frequently for
**Goal**than for**Pledged**(since founders*control*it). - Test for interactions between covariates. As mentioned above, the model does not incorporate how covariates might be associated. For example,
**Backers**and**Facebook.Shares**are correlated with each other as well as being correlated with**Pledged**. At the moment, the linear regression model assumes the covariates are independent, and could be interpreted as a 1% increase in**Backers**will have a*x*% increase in**Pledged**. Adding interaction terms would help account for how movement in one covariate is associated with movement in another covariate. This might create a more accurate model, but it would also help ensure that the interaction is being considered when interpreting the model. - Differentiate user types. Platforms often have “super-users”, who have a large impact on the platform. In the case of Kickstarter, these could be project owners who have a history of very successful projects. It could be very interesting to separate these users from users who are starting their first project. This would be interesting, as “super users” potentially have a network already built to support the success of a Kickstarter project. As a result, the actions to make their project successful may be different than that of a first time project starter. This is just one example; however, there are likely many ways in which an experienced user will interact differently. Creating models for both a first time user and for a super user could potentially make both models more accurate but might also give interesting insights when comparing the significant covariates of each model.