Whisking Through Data for Scrumptious Calorie Predictions

Authors: Souma Mitra (souma@umich.edu) and Abby VeCasey (avecasey@umich.edu)

Introduction

The “Recipes and Ratings” dataset contains thousands of recipes from food.com, each row representing a review of a particular recipe. Each recipe can have multiple reviews, and therefore multiple rows in the dataset. There are 219393 rows and 16 columns in our cleaned dataset. We will use this dataset to answer the question of how accurately we can predict the number of calories in a recipe given other recipe details, and which factors best predict the number of calories, and to answer this we will use columns “name”, which is the recipe name, “id”, which is a numerical id given to identity each recipe, “minutes”, which is the number of minutes it takes to complete each recipe, “contributor_id”, which is the unique id of the person who submitted the recipe, “n_steps”, which is the number of steps it takes to complete the recipe, “n_ingredients”, which is the number of ingredients in the recipe, “average_rating”, which is the average rating given to the recipe, and “calories”, “total fat”, “sugar”, “sodium”, “protein”, “saturated fat”, and “carbohydrates”, which are the amounts of each of these in the recipe. By understanding this dataframe and answering this question, we can provide recipe writers with a better understanding of how many calories are in each of their recipes without having to calculate it.

Data Cleaning and Exploratory Data Analysis

Data Cleaning

To begin cleaning the data, we merged the recipes and ratings datasets together, as the dataset suggested. We then changed all ratings of 0 to be np.nan, which makes sense because ratings usually only go from 1-5, so recipes with ratings of 0 indicate that there is no rating. We then added an “average_rating” column, which is the average rating per recipe. Then, we decided to split the “nutrition” column into separate columns for each nutrition statistic, for example “calories” and “total fat”. Doing this allowed us to use these new variables for our analysis, and to find new relationships between variables that we could not have previously investigated. Then, we had to convert each of these new columns from type “object” to type “float”, because these are numerical values, and we needed them to be represented this way to proceed with our analysis of their relationships with other variables. Doing this allowed us to be able to visualize these variables with charts like scatter plots and histograms. Finally, we dropped columns unnecessary for our analysis, including “recipe_id” because it was a duplicate of “id”, “date submitted”, “’steps”, “description”, “ingredients”, “tags”, and “review”. Below is a visual of the first 5 rows of our cleaned dataset.

name	id	minutes	contributor_id	n_steps	n_ingredients	user_id	rating	average_rating	calories	total fat	sugar	sodium	protein	saturated fat	carbohydrates
1 brownies in the world best ever	333281	40	985201	10	9	386585	4	4	138.4	10	50	3	3	19	6
1 in canada chocolate chip cookies	453467	45	1848091	12	11	424680	5	5	595.1	46	211	22	13	51	26
412 broccoli casserole	306168	40	50969	6	9	29782	5	5	194.8	20	6	32	22	36	3
412 broccoli casserole	306168	40	50969	6	9	1.19628e+06	5	5	194.8	20	6	32	22	36	3
412 broccoli casserole	306168	40	50969	6	9	768828	5	5	194.8	20	6	32	22	36	3

Univariate Analysis

Next, we began exploratory data analysis, creating a histogram of the distribution of calories for recipes with fewer than 1000 calories. From the graph below, we can see that the distribution of calories is skewed to the right, and most recipes in this range have between 150 and 300 calories.

Bivariate Analysis

We then explored the relationship between calories and number of ingredients in the recipe, deciding to split recipes between those with less than 9 ingredients and those with 9 or more. From the boxplot below, we can see that the median number of calories for recipes with fewer than 9 ingredients is below the median number of calories for recipes in the other group, which implies that recipes with more ingredients tend to have more calories.

We also explored the relationship between calories and the amount of time it took to complete the recipe as a categorical variable, splitting recipes into either a “less than 300 minutes” category, or a “more than 300 minutes” category. From the boxplot below, we can see that recipes that take over 300 minutes (5 hours) to complete tend to have more calories than recipes that take less than 5 hours to complete. This will be important later on for our prediction model.

Interesting Aggregates

This group by table shows us the average nutrition information, number of recipes, and average recipe information per recipe contributor, of the top 50 contributors. This information is important to understand the nature of recipes that the top 50 recipe authors post, and to determine which authors may be the most successful.

total_recipes	average_minutes	average_steps	average_n_ingredients	average_rating	average_calories	average_total_fat	average_sugar	average_sodium	average_protein	average_saturated_fat	average_carbohydrates
3060	59.2291	10.9562	9.65065	4.7874	417.687	33.4444	74.702	28.9265	23.6765	34.1755	14.802
701	95.4779	11.5977	9.98146	4.65391	482.653	33.5264	63.3053	30.1954	40.4308	46.1683	16.174
751	56.5499	9.27963	8.78828	4.45075	420.367	31.3422	46.719	32.6724	43.9294	38.7763	11.9148
868	126.018	11.2442	10.8203	4.7285	236.902	14.826	52.0991	13.7638	17.5956	16.6233	8.74078
1795	43.429	8.26797	7.84345	4.78293	319.64	23.3939	39.7788	31.8396	31.1671	27.9426	9.11866
2436	97.3001	9.35591	9.57348	4.78781	437.114	38.8612	63.5739	28.9606	33.9901	42.2496	11.2278
802	35.793	5.92519	8.94638	4.75231	357.314	25.101	54.3017	20.8254	29.9601	27.7469	11.0249
699	73.4964	8.80114	8.90701	4.77697	312.614	21.3419	54.7239	23.8913	21.2432	25.0057	11.2947
1091	66.7525	7.50596	10.4088	4.84694	508.263	37.9982	98.5866	28.4876	30.7754	57.2374	17.8863
1614	72.3234	10.6945	9.0316	4.70194	376.489	29.9777	42.7187	25.4455	35.2534	43.2007	10.2169
2310	60.4697	9.65758	8.34545	4.82547	434.246	32.0208	93.1442	21.6277	25.8442	40.4463	15.4874
1134	50.8986	6.85626	7.34656	4.76256	277.613	23.94	29.4374	30.3598	17.7637	18.7725	8.13668
705	42.078	7.50355	8.8766	4.82602	380.924	25.9589	79.2766	16.7745	22.2681	34.6241	15.3901
1143	50.5459	9.78828	9.31584	4.78699	344.828	29.0236	39.8215	43.6089	30.7095	31.2441	8.97725
995	43.6151	8.36683	8.77588	4.74174	592.648	44.0814	58.2985	32.3206	55.4151	54.0844	17.8553
2754	112.537	12.5425	10.8951	4.84972	559.235	48.9299	86.459	30.6612	41.8217	64.8998	15.0298
1111	2078.74	13.0945	8.75788	4.81395	518.957	36.8911	51.0513	27.9793	37.2529	45.541	17.6562
994	52.5835	9.93662	10.1217	4.83196	441.355	34.994	60.8219	26.4799	37.7223	38.2887	13.3863
725	50.8193	9.79586	9.52	4.76371	267.761	17.7531	26.3697	20.9628	39.2566	19.1628	6.22207
788	75.302	9.28553	8.87944	4.50738	380.978	26.0279	58.5431	29.8668	32.1701	25.8553	13.1713
1193	44.7385	6.58592	8.69573	4.78718	381.246	26.2506	46.1928	19.6471	39.0444	31.3546	12.0075
1035	71.1208	10.4126	8.29565	4.77345	431.053	32.5556	50.57	28.7382	40.1411	42.3275	12.2464
741	74.2011	11.4062	9.88529	4.76122	554.227	42.0553	127.615	27.9433	37.6329	61.0189	19.0958
758	63.1636	12.2032	9.6372	4.61769	483.157	29.3087	81.9472	39.0989	42.8391	38.1253	18.6095
891	100.516	7.16611	7.51515	4.60748	383.474	29.2963	39.4714	31.7127	39.1762	38.5679	10.523
2368	41.0249	8.86318	8.64105	4.76776	359.764	26.5933	54.7382	18.5988	28.5722	33.0325	10.9848
1553	79.0547	11.5325	9.566	4.76684	562.079	44.3786	77.2968	38.0953	41.49	52	17.5274
849	82.6031	9.42167	9.9788	4.68005	470.633	39.8657	61.9847	39.4782	40.3498	52.4523	12.6207
2503	33.6336	6.0004	7.82621	4.80303	263.465	15.8326	74.1454	10.2089	15.1898	20.3947	11.4914
1016	57.3465	11.9409	9.20374	4.75181	440.324	30.2913	113.75	36.315	31.5974	43.6722	16.4429
1680	37.5167	10.3125	8.60417	4.85226	388.325	33.2821	50.0351	27.272	30.0405	45.4149	9.99107
1588	52.0082	7.5699	7.33312	4.70259	355.508	29.7469	47.1159	17.8778	34.8268	33.665	8.58501
969	53.7162	12.2425	9.95666	4.8307	488.985	38.0175	43.6264	37.1238	46.1868	44.3849	13.4974
863	60.5261	10.6477	9.73812	4.73609	382.113	30.4936	44.5017	20.4137	31.219	34.9143	11.102
1060	36.9934	8.65	8.03774	4.73159	394.951	28.9689	60.4943	23.084	30.5547	36.2594	12.5651
872	95.9771	8.07913	6.93693	4.8392	320.869	25.82	53.5665	13.0069	20.9622	28.4851	7.88188
1572	32.9275	7.97519	7.83842	4.77626	327.606	21.9364	63.4663	18.6202	20.9835	24.5344	12.9148
905	62.1293	9.64862	8.3768	4.76712	368.702	28.2575	54.0398	27.2608	32.3536	35.2287	10.8729
863	33.3453	8.6686	8.63963	4.75957	433.118	27.9571	54.9733	14.5203	24.3708	34.6107	18.3882
743	78.0754	9.1319	8.16555	4.79184	350.871	24.5976	67.179	9.66622	21.6393	21.852	13.5222
904	70.6925	8.68584	8.02655	4.74286	458.836	30.4524	50.8827	23.7113	33.5199	39.5221	17.0918
1711	135.907	6.80538	8.9287	4.7199	452.228	34.6803	61.6283	36.0158	44.8328	43.5786	12.7551
1236	55.962	8.04126	8.48301	4.76789	275.127	18.2856	55.5129	13.5049	19.2087	23.6545	10.4256
1019	55.4622	10.841	9.08342	4.71728	283.828	20.2777	31.0913	18.8744	27.577	19.7733	8.34249
762	32.5853	7.79265	7.75853	4.7038	256.524	17.4528	40.6732	13.9081	22.6745	21.2743	8.51706
759	48.8498	9.64032	9.82082	4.67647	339.621	27.5415	25.361	20.3953	36.0487	32.5652	8.01318
876	55.0719	8.9726	9.65183	4.7623	289.536	19.7203	41.5925	19.7637	29.6929	26.1313	8.71689
805	55.0422	8.38509	9.07205	4.78834	332.084	27.2099	29.636	45.8373	27.4646	32.1615	8.82484
1867	84.7381	11.045	9.74879	4.81743	463.962	35.5865	76.7017	42.8956	35.4253	44.4124	14.4649
738	32.393	10.4187	8.59485	4.77496	445.8	39.3008	59.1369	20.1125	32.5583	53.8482	12.1084

Missing Value Imputation

We did not fill in missing values, as we learned from changing ratings of 0 to null that these reviews probably did not have a rating. We did not use the ratings for our analysis, and this was the only column in the dataset that we used that had more than one missing value, so we dropped rows where they were missing values in the columns we were using and moved forward with our analysis.

Framing a Prediction Problem

We will be predicting the number of calories a recipe has based on the nutrition information. This is a regression problem. This is useful because it can allow for recipe contributors to estimate the calories in a recipe based on the nutrition information, and see which factors contribute most significantly to the number of calories. To evaluate the performance of this model, we will be using MSE (Mean Squared Error). —

Baseline Model

For this model, we will be conducting a linear regression of the number of calories on the total fat, sugar, sodium, protein, saturated fat, and carbohydrates, all of which are quantitative variables. We also chose to standardize each of the features, to allow us to compare the effects and understand which was the most important in predicting calories. The MSE of this model was 1471.969. With an MSE of 1471.969, we do not believe this is a very good model, as the average error when predicting the number of calories in a recipe is over 1000 calories, which is a huge difference. —

Final Model

To improve upon our model, we decided to add the features described above, and use a Lasso regression to help with variable selection. We decided it was important to use the variable 300_mins_or_less, because we saw a difference in distribution of calories based on this distinction. Similarly, we added the variable less_than_9_ingredients because we saw a difference in calorie distribution between recipes with less than 9 ingredients and recipes with 9 or more ingredients. We tuned our hyperparameters with cross-validation. The MSE of this model was 1469.696, which is a slight improvement from our baseline model.