Skip to the content.

Whisking Through Data for Scrumptious Calorie Predictions

Authors: Souma Mitra (souma@umich.edu) and Abby VeCasey (avecasey@umich.edu)

Introduction

The “Recipes and Ratings” dataset contains thousands of recipes from food.com, each row representing a review of a particular recipe. Each recipe can have multiple reviews, and therefore multiple rows in the dataset. There are 219393 rows and 16 columns in our cleaned dataset. We will use this dataset to answer the question of how accurately we can predict the number of calories in a recipe given other recipe details, and which factors best predict the number of calories, and to answer this we will use columns “name”, which is the recipe name, “id”, which is a numerical id given to identity each recipe, “minutes”, which is the number of minutes it takes to complete each recipe, “contributor_id”, which is the unique id of the person who submitted the recipe, “n_steps”, which is the number of steps it takes to complete the recipe, “n_ingredients”, which is the number of ingredients in the recipe, “average_rating”, which is the average rating given to the recipe, and “calories”, “total fat”, “sugar”, “sodium”, “protein”, “saturated fat”, and “carbohydrates”, which are the amounts of each of these in the recipe. By understanding this dataframe and answering this question, we can provide recipe writers with a better understanding of how many calories are in each of their recipes without having to calculate it.


Data Cleaning and Exploratory Data Analysis

Data Cleaning

To begin cleaning the data, we merged the recipes and ratings datasets together, as the dataset suggested. We then changed all ratings of 0 to be np.nan, which makes sense because ratings usually only go from 1-5, so recipes with ratings of 0 indicate that there is no rating. We then added an “average_rating” column, which is the average rating per recipe. Then, we decided to split the “nutrition” column into separate columns for each nutrition statistic, for example “calories” and “total fat”. Doing this allowed us to use these new variables for our analysis, and to find new relationships between variables that we could not have previously investigated. Then, we had to convert each of these new columns from type “object” to type “float”, because these are numerical values, and we needed them to be represented this way to proceed with our analysis of their relationships with other variables. Doing this allowed us to be able to visualize these variables with charts like scatter plots and histograms. Finally, we dropped columns unnecessary for our analysis, including “recipe_id” because it was a duplicate of “id”, “date submitted”, “’steps”, “description”, “ingredients”, “tags”, and “review”. Below is a visual of the first 5 rows of our cleaned dataset.

name id minutes contributor_id n_steps n_ingredients user_id rating average_rating calories total fat sugar sodium protein saturated fat carbohydrates
1 brownies in the world best ever 333281 40 985201 10 9 386585 4 4 138.4 10 50 3 3 19 6
1 in canada chocolate chip cookies 453467 45 1848091 12 11 424680 5 5 595.1 46 211 22 13 51 26
412 broccoli casserole 306168 40 50969 6 9 29782 5 5 194.8 20 6 32 22 36 3
412 broccoli casserole 306168 40 50969 6 9 1.19628e+06 5 5 194.8 20 6 32 22 36 3
412 broccoli casserole 306168 40 50969 6 9 768828 5 5 194.8 20 6 32 22 36 3

Univariate Analysis

Next, we began exploratory data analysis, creating a histogram of the distribution of calories for recipes with fewer than 1000 calories. From the graph below, we can see that the distribution of calories is skewed to the right, and most recipes in this range have between 150 and 300 calories.

Bivariate Analysis

We then explored the relationship between calories and number of ingredients in the recipe, deciding to split recipes between those with less than 9 ingredients and those with 9 or more. From the boxplot below, we can see that the median number of calories for recipes with fewer than 9 ingredients is below the median number of calories for recipes in the other group, which implies that recipes with more ingredients tend to have more calories.

We also explored the relationship between calories and the amount of time it took to complete the recipe as a categorical variable, splitting recipes into either a “less than 300 minutes” category, or a “more than 300 minutes” category. From the boxplot below, we can see that recipes that take over 300 minutes (5 hours) to complete tend to have more calories than recipes that take less than 5 hours to complete. This will be important later on for our prediction model.

Interesting Aggregates

This group by table shows us the average nutrition information, number of recipes, and average recipe information per recipe contributor, of the top 50 contributors. This information is important to understand the nature of recipes that the top 50 recipe authors post, and to determine which authors may be the most successful.

total_recipes average_minutes average_steps average_n_ingredients average_rating average_calories average_total_fat average_sugar average_sodium average_protein average_saturated_fat average_carbohydrates
3060 59.2291 10.9562 9.65065 4.7874 417.687 33.4444 74.702 28.9265 23.6765 34.1755 14.802
701 95.4779 11.5977 9.98146 4.65391 482.653 33.5264 63.3053 30.1954 40.4308 46.1683 16.174
751 56.5499 9.27963 8.78828 4.45075 420.367 31.3422 46.719 32.6724 43.9294 38.7763 11.9148
868 126.018 11.2442 10.8203 4.7285 236.902 14.826 52.0991 13.7638 17.5956 16.6233 8.74078
1795 43.429 8.26797 7.84345 4.78293 319.64 23.3939 39.7788 31.8396 31.1671 27.9426 9.11866
2436 97.3001 9.35591 9.57348 4.78781 437.114 38.8612 63.5739 28.9606 33.9901 42.2496 11.2278
802 35.793 5.92519 8.94638 4.75231 357.314 25.101 54.3017 20.8254 29.9601 27.7469 11.0249
699 73.4964 8.80114 8.90701 4.77697 312.614 21.3419 54.7239 23.8913 21.2432 25.0057 11.2947
1091 66.7525 7.50596 10.4088 4.84694 508.263 37.9982 98.5866 28.4876 30.7754 57.2374 17.8863
1614 72.3234 10.6945 9.0316 4.70194 376.489 29.9777 42.7187 25.4455 35.2534 43.2007 10.2169
2310 60.4697 9.65758 8.34545 4.82547 434.246 32.0208 93.1442 21.6277 25.8442 40.4463 15.4874
1134 50.8986 6.85626 7.34656 4.76256 277.613 23.94 29.4374 30.3598 17.7637 18.7725 8.13668
705 42.078 7.50355 8.8766 4.82602 380.924 25.9589 79.2766 16.7745 22.2681 34.6241 15.3901
1143 50.5459 9.78828 9.31584 4.78699 344.828 29.0236 39.8215 43.6089 30.7095 31.2441 8.97725
995 43.6151 8.36683 8.77588 4.74174 592.648 44.0814 58.2985 32.3206 55.4151 54.0844 17.8553
2754 112.537 12.5425 10.8951 4.84972 559.235 48.9299 86.459 30.6612 41.8217 64.8998 15.0298
1111 2078.74 13.0945 8.75788 4.81395 518.957 36.8911 51.0513 27.9793 37.2529 45.541 17.6562
994 52.5835 9.93662 10.1217 4.83196 441.355 34.994 60.8219 26.4799 37.7223 38.2887 13.3863
725 50.8193 9.79586 9.52 4.76371 267.761 17.7531 26.3697 20.9628 39.2566 19.1628 6.22207
788 75.302 9.28553 8.87944 4.50738 380.978 26.0279 58.5431 29.8668 32.1701 25.8553 13.1713
1193 44.7385 6.58592 8.69573 4.78718 381.246 26.2506 46.1928 19.6471 39.0444 31.3546 12.0075
1035 71.1208 10.4126 8.29565 4.77345 431.053 32.5556 50.57 28.7382 40.1411 42.3275 12.2464
741 74.2011 11.4062 9.88529 4.76122 554.227 42.0553 127.615 27.9433 37.6329 61.0189 19.0958
758 63.1636 12.2032 9.6372 4.61769 483.157 29.3087 81.9472 39.0989 42.8391 38.1253 18.6095
891 100.516 7.16611 7.51515 4.60748 383.474 29.2963 39.4714 31.7127 39.1762 38.5679 10.523
2368 41.0249 8.86318 8.64105 4.76776 359.764 26.5933 54.7382 18.5988 28.5722 33.0325 10.9848
1553 79.0547 11.5325 9.566 4.76684 562.079 44.3786 77.2968 38.0953 41.49 52 17.5274
849 82.6031 9.42167 9.9788 4.68005 470.633 39.8657 61.9847 39.4782 40.3498 52.4523 12.6207
2503 33.6336 6.0004 7.82621 4.80303 263.465 15.8326 74.1454 10.2089 15.1898 20.3947 11.4914
1016 57.3465 11.9409 9.20374 4.75181 440.324 30.2913 113.75 36.315 31.5974 43.6722 16.4429
1680 37.5167 10.3125 8.60417 4.85226 388.325 33.2821 50.0351 27.272 30.0405 45.4149 9.99107
1588 52.0082 7.5699 7.33312 4.70259 355.508 29.7469 47.1159 17.8778 34.8268 33.665 8.58501
969 53.7162 12.2425 9.95666 4.8307 488.985 38.0175 43.6264 37.1238 46.1868 44.3849 13.4974
863 60.5261 10.6477 9.73812 4.73609 382.113 30.4936 44.5017 20.4137 31.219 34.9143 11.102
1060 36.9934 8.65 8.03774 4.73159 394.951 28.9689 60.4943 23.084 30.5547 36.2594 12.5651
872 95.9771 8.07913 6.93693 4.8392 320.869 25.82 53.5665 13.0069 20.9622 28.4851 7.88188
1572 32.9275 7.97519 7.83842 4.77626 327.606 21.9364 63.4663 18.6202 20.9835 24.5344 12.9148
905 62.1293 9.64862 8.3768 4.76712 368.702 28.2575 54.0398 27.2608 32.3536 35.2287 10.8729
863 33.3453 8.6686 8.63963 4.75957 433.118 27.9571 54.9733 14.5203 24.3708 34.6107 18.3882
743 78.0754 9.1319 8.16555 4.79184 350.871 24.5976 67.179 9.66622 21.6393 21.852 13.5222
904 70.6925 8.68584 8.02655 4.74286 458.836 30.4524 50.8827 23.7113 33.5199 39.5221 17.0918
1711 135.907 6.80538 8.9287 4.7199 452.228 34.6803 61.6283 36.0158 44.8328 43.5786 12.7551
1236 55.962 8.04126 8.48301 4.76789 275.127 18.2856 55.5129 13.5049 19.2087 23.6545 10.4256
1019 55.4622 10.841 9.08342 4.71728 283.828 20.2777 31.0913 18.8744 27.577 19.7733 8.34249
762 32.5853 7.79265 7.75853 4.7038 256.524 17.4528 40.6732 13.9081 22.6745 21.2743 8.51706
759 48.8498 9.64032 9.82082 4.67647 339.621 27.5415 25.361 20.3953 36.0487 32.5652 8.01318
876 55.0719 8.9726 9.65183 4.7623 289.536 19.7203 41.5925 19.7637 29.6929 26.1313 8.71689
805 55.0422 8.38509 9.07205 4.78834 332.084 27.2099 29.636 45.8373 27.4646 32.1615 8.82484
1867 84.7381 11.045 9.74879 4.81743 463.962 35.5865 76.7017 42.8956 35.4253 44.4124 14.4649
738 32.393 10.4187 8.59485 4.77496 445.8 39.3008 59.1369 20.1125 32.5583 53.8482 12.1084

Missing Value Imputation

We did not fill in missing values, as we learned from changing ratings of 0 to null that these reviews probably did not have a rating. We did not use the ratings for our analysis, and this was the only column in the dataset that we used that had more than one missing value, so we dropped rows where they were missing values in the columns we were using and moved forward with our analysis.


Framing a Prediction Problem

We will be predicting the number of calories a recipe has based on the nutrition information. This is a regression problem. This is useful because it can allow for recipe contributors to estimate the calories in a recipe based on the nutrition information, and see which factors contribute most significantly to the number of calories. To evaluate the performance of this model, we will be using MSE (Mean Squared Error). —

Baseline Model

For this model, we will be conducting a linear regression of the number of calories on the total fat, sugar, sodium, protein, saturated fat, and carbohydrates, all of which are quantitative variables. We also chose to standardize each of the features, to allow us to compare the effects and understand which was the most important in predicting calories. The MSE of this model was 1471.969. With an MSE of 1471.969, we do not believe this is a very good model, as the average error when predicting the number of calories in a recipe is over 1000 calories, which is a huge difference. —

Final Model

To improve upon our model, we decided to add the features described above, and use a Lasso regression to help with variable selection. We decided it was important to use the variable 300_mins_or_less, because we saw a difference in distribution of calories based on this distinction. Similarly, we added the variable less_than_9_ingredients because we saw a difference in calorie distribution between recipes with less than 9 ingredients and recipes with 9 or more ingredients. We tuned our hyperparameters with cross-validation. The MSE of this model was 1469.696, which is a slight improvement from our baseline model.