Recipe Recommendation With Hierarchical Graph Attention Network

Recipe recommendation systems play an important role in helping people find recipes that are of their interest and fit their eating habits. Unlike what has been developed for recommending recipes using content-based or collaborative filtering approaches, the relational information among users, recipes, and food items is less explored. In this paper, we leverage the relational information into recipe recommendation and propose a graph learning approach to solve it. In particular, we propose HGAT, a novel hierarchical graph attention network for recipe recommendation. The proposed model can capture user history behavior, recipe content, and relational information through several neural network modules, including type-specific transformation, node-level attention, and relation-level attention. We further introduce a ranking-based objective function to optimize the model. Thorough experiments demonstrate that HGAT outperforms numerous baseline methods.


INTRODUCTION
Large-scale food data offers rich knowledge about food and can help tackle many central issues of human society (Mouritsen et al., 2017;Min et al., 2019;Tian et al., 2021). Recipe websites, in particular, contain a large volume of food data because individuals are eager to share their created recipes online (Teng et al., 2012). This provides an opportunity for other users to rate and comment, which helps people form the habit of referring to these websites when deciding what to eat (Ueda et al., 2011). Food.com 1 , one of the largest recipe-sharing websites in the world, collects over half a million recipes. This large volume of data also reflects the great demand for recipe-providing services (Ueda et al., 2011). Accordingly, digging into this overwhelming amount of online recipe resources to find a satisfying recipe is always hard (Britto et al., 2020), especially when recipes are associated with various heterogeneous content such us ingredients, instructions, nutrients, and user feedback. However, Recipe Recommendation Systems have the power to help users navigate through tons of online recipe data and recommend recipes that align with users' preferences and history behavior (Khan et al., 2019).
Existing recipe recommendation approaches are mostly based on the similarity between recipes (Yang et al., 2017;Chen et al., 2020). A few of the approaches tried to take the user information into account (Freyne and Berkovsky, 2010;Forbes and Zhu, 2011;Ge et al., 2015;Vivek et al., 2018;Khan et al., 2019;Gao et al., 2020), but they only defined similar users based on the overlapping rated recipes between users, while ignoring the relational information between users, recipes, or ingredients. Nevertheless, user preference toward food is complex. A user may decide to try a new recipe because of its ingredients, its flavor, or a friend's recommendation. Therefore, a thoughtful recipe recommendation should take all these factors into account. Thus, it is important to encode the relational information and deeply understand the relationship among users, recipes, and ingredients for recipe recommendation.
In this paper, we seek to leverage the relational information into recipe recommendation. We first construct a heterogeneous recipe graph to formulate the relationship among nodes. In particular, we start by collecting a corpus of recipes, where each recipe contains ingredients, instructions, and user ratings. We then transform this set of recipes into a recipe graph, with three types of nodes (i.e., ingredient, recipe, and user) and four types of relations connecting them (i.e., ingredientingredient, recipe-ingredient, recipe-recipe, and user-recipe relations). The illustration of the built recipe graph is shown in Figure 1. After constructing the recipe graph, we propose to solve the recipe recommendation problem using the graph learning approach, which naturally incorporates the relational information into recommendation.
In particular, we propose a novel heterogeneous recipe graph recommendation model, HGAT, which stands for Hierarchical Graph Attention Network. HGAT can recommend recipes to users that align with their history behavior and preferences. Specifically, we leverage several neural network modules to encode the recipe history, recipe content, and relational information. Specifically, we first apply a type-specific transformation matrix to model the heterogeneous content associated with each node (e.g., instructions and nutrients) and transform them into a shared embedding space. Then, we design a node-level attention module to encode the neighbor nodes with different weights. Considering we have multiple types of nodes and edges, the module individually runs for each relation to formulate relation-specific embeddings that contain each type of neighbor nodes information. For example, given a recipe node with three connected relations (i.e., recipe-ingredient, reciperecipe, and recipe-user), the node-level attention module encodes each type of node individually and learns three relation-specific embeddings. Next, we develop a relation-level attention module to combine all the generated relation-specific embeddings with different weights and obtain the updated embedding for each node. To illustrate with the same example, the relation-level attention module merge the learned three relation-specific embeddings into one to represent the final embedding of the given recipe node. Therefore, the learned embeddings contain not only the neighbor nodes' information but also the connected relation information. Finally, we introduce a score predictor and a ranking-based objective function based on the learned user and recipe embeddings to optimize the model. To summarize, our main contributions in this paper are as follows: • We argue that relational information is important in understanding user preference toward recipes. We further leverage this information into the recipe recommendation problem and proposed a graph learning approach to solve it.
• We develop HGAT, a hierarchical graph attention network for recipe recommendation. HGAT is able to capture both node content and relational information and make appropriate recommendations. HGAT comprises several neural network modules, including type-specific transformation, node-level attention, and relation-level attention. • We conduct extensive experiments to evaluate the performance of our model. The results show the superiority of HGAT by comparing with a number of baseline methods for recipe recommendation.
The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 describes the proposed model. Section 4 presents the experiments of different models on recipe recommendation, followed by the conclusion in section 5.

RELATED WORK
This work is closely related to the studies of food recommendation, recipe recommendation, and graph representation learning. Food Recommendation. Food recommendation aims to provide a list of food items for users that meet their preference and personalized needs, including restaurants, individual food items, meals, and recipes (Trattner and Elsweiler, 2017a;Min et al., 2020). Despite food recommendation being a comparatively understudied research problem, a decent body of literature exists (Trattner and Elsweiler, 2017a). For example, Sano et al. (2015) used the transaction data in a real grocery store to recommend grocery items. Trattner and Elsweiler (2017b) used nine prominent recommender algorithms from the LibRec 2 framework to recommend meal plans and recipes. In our work, we mainly focus on recipe recommendation since this is the one that is most relevant to our daily life. Existing recipe recommendation approaches are mostly content-based, namely, recommending recipes based on the similarity between recipes (Yang et al., 2017;Chen et al., 2020). A few of the approaches proposed include user information into the recommendation procedures (i.e., collaborative filtering). Still, they only considered similar users based on the overlapping rated recipes, ignoring the relational information among users, recipes, or ingredients (Freyne and Berkovsky, 2010;Forbes and Zhu, 2011;Ge et al., 2015;Vivek et al., 2018;Khan et al., 2019;Gao et al., 2020). For example, Yang et al. (2017) developed a framework to learn food preference based on the itemwise and pairwise recipe image comparisons. Ge et al. (2015) utilized a matrix factorization approach that fuses user ratings and tags for recipe recommendation. On the other side, several works tried to recommend recipes based on a built graph, but the user information is not included (Li et al., 2010;Teng et al., 2012;Adaji et al., 2018). For example, Adaji et al. (2018) recommended recipes to users based on a graph where 2 recipes are connected if the same person has reviewed them. Li et al. (2010) constructed a cooking graph where the nodes are cooking actions or ingredients Recipe examples with ingredients, instructions, and user reviews are shown on the (left). The Recipe Graph (right) includes three types of nodes (i.e., ingredient, recipe, and user) and four types of relations which connect these nodes (i.e., ingredient-ingredient, recipe-ingredient, recipe-recipe, and user-recipe relations). and recommend recipes based on their similarity. Teng et al. (2012) build two types of ingredient graphs to predict recipe pairs based on the substitution or complement of ingredients. Haussmann et al. (2019) leveraged a knowledge base question answering approach to recommend recipes based on the ingredients. In our work, we try to model the relational information through a heterogeneous recipe graph with user information included. Therefore, the graph learning approach could automatically encode the relational information and make considerate recommendations accordingly. User Behavior Modeling. User Behavior Modeling is widely studied in the domain of recommendation. For example, Zhou et al. (2018) proposed an attention based user behavior modeling framework that projects all types of behaviors into multiple latent semantic spaces for recommendation. Elkahky et al. (2015) used a rich feature set to represent users, including their web browsing history and search queries to propose a content-based recommendation system. Abel et al. (2011) analyzed how user profiles benefit from semantic enrichment and compared different user modeling strategies in a personalized news recommendation system. In the field of food recommendation, Zhang et al. (2016) adopted user feedback of dining behavior to recommend restaurants. Musto et al. (2020) developed a recommendation strategy based on knowledge about food and user health-related characteristics by focusing on personal factors such as the BMI of users and dietary constraints. Our work incorporates user history behavior and user feedback such as the ratings toward recipes, which make our recommendation accounts for user interest and preferences. Graph Representation Learning. Graph representation learning has become one of the most popular data mining topics in the past few years (Wu et al., 2021). Many graph representation learning approaches (Perozzi et al., 2014;Dong et al., 2017;Hamilton et al., 2017;Kipf and Welling, 2017;Schlichtkrull et al., 2018;Velickovic et al., 2018) were proposed to encode the graph-structure data. They take advantage of the content information associated with each node and the relational information in graph to learn vectorized embeddings, which are used in various graph mining tasks such as recommendation. For example, DeepWalk (Perozzi et al., 2014) learned node embeddings by feeding a set of random walks into a SkipGram model (Mikolov et al., 2013). Nodes in the graph were trained on each walk simultaneously where the neighbor nodes served as the contextual information. matapath2vec (Dong et al., 2017) conducted meta-path based random walks and utilized a SkipGram model to embed the heterogeneous graph. GAT (Velickovic et al., 2018) utilized an attention mechanism into message passing to learn node embeddings on homogeneous graphs by aggregating neighbor nodes' features with different attentions. In our work, we propose a hierarchical graph attention network that leverages attention mechanisms on different levels (i.e., node-level and relationlevel), which can be applied on heterogeneous graphs and achieve outstanding performance.

PROPOSED MODEL
In this section, we describe our novel hierarchical graph attention network model, HGAT. As illustrated in Figure 2, our model contains several major components. We first apply a type-specific transformation matrix to take various input features and project them into a shared embedding space. We then design a nodelevel attention module to encode the neighbor nodes that are connected by each relation to learn relation-specific embeddings. Next, we develop a relation-level attention module to combine all relation-specific embeddings and obtain the updated embedding for each node. Finally, we introduce a score predictor and a ranking-based objective function to optimize the model.

Type-Specific Transformation
Previous works (Salvador et al., 2017;Marin et al., 2019) have shown that cooking instructions are necessary for providing a discriminative understanding of the cooking process. We follow Marin et al.'s work (Marin et al., 2019) to use the pretrained skipinstruction embeddings as the instruction representations, and then average these representations to get one input raw feature for each recipe. For example, given a recipe with M sentence instruction embeddings {x ins,1 , x ins,2 , . . . , x ins,M }, we calculate the average instruction embeddings x ins as the input feature for recipe nodes: In addition, we formulate the nutrients of each ingredient into a vector and then use this nutrient vector as the content information for ingredient nodes. We denote this content information as the input feature x ing for ingredient nodes. To represent the user nodes, since most of the users in Food.com have not provided any detailed information about themselves (e.g., description, location, preference, or demographic information) and it might violate the privacy policy by crawling this individual information, we use the xavier normal random initialized feature (Glorot and Bengio, 2010) as the input feature for user nodes, denoted as x user .
Due to the heterogeneity of input features for different node types (i.e., recipe, ingredient, and user), given a node v i with type φ i , we introduce a type-specific transformation matrix W φ i to project the input features into the same embedding space. Specifically, for each node v i , we have the projection process formulated as follows: In other words, with this type-specific transformation operation, the instruction, ingredient, and user embeddings would be in shared embedding space, and the model can therefore take arbitrary types of input features.

Node-Level Attention
To encode and fuse the neighbor nodes information, we propose the node-level attention module, which is based on the attention aggregator as shown in Figure 3A. Compared to the mean pooling aggregator ( Figure 3B) and concatenation aggregator ( Figure 3C) that have been used widely (Hamilton et al., 2017) but simply combine the features, attention aggregator can learn the importance of each neighbor node and fuse them wisely. Specifically, for a node v i , we first use the node-level attention to calculate each relation-specific embedding h i,r for each relation r that connects to v i . To explain how we get the relation-specific embeddings h i,r , we will start by describing a single node-level attention layer, as to calculate the node-level attention for each node. To compute the h i,r in layer l + 1, the input of the nodelevel attention layer is a set of node embeddings from layer where N i,r denotes the number of neighbor nodes that connect to v i through relation r, and d l is the dimension of embeddings in layer l. In order to acquire sufficient expressive power to transform the input features into higher-level features z i ∈ R d l+1 , where d l+1 is the dimension of embeddings in layer l + 1, a shared linear transformation weight matrix W r ∈ R d l ×d l+1 for relation r is applied: With the intermediary features z i , z j for nodes v i and v j , respectively, we calculate the unnormalized attention score e ij between v i and v j to indicate the importance of v j to v i . The calculation process is defined as follows: where indicates the concatenation operator and W ij ∈ R 2d l+1 is a weight vector that represents the attention between v i and v j . Theoretically, our model can calculate the attention of every node to every other node without considering the relational information. Taking the message passing protocol into consideration, we acknowledge the graph structure and perform masked attention (Velickovic et al., 2018), which only computes e ij if there exists an edge between v i and v j in the graph. In other words, we focus only on the first-order neighbor nodes of v i (including v i ). We further normalize the e ij using the softmax function to make the coefficients easily comparable across different nodes. The normalized node-level attention vector α ij is computed as: After that, we use α ij as coefficients to linearly combine the neighbor nodes features and generate the relation-specific embedding h i,r : The process is formulated as follows: where σ is the nonlinear activation function (we use ReLU in our experiment). Instead of simply performing a single attention function, inspired by previous work (Vaswani et al., 2017), we extend the node-level attention to multi-head attention so that the model and the training process are more stable. In particular, we compute the node-level attention M times in parallel, and then concatenate the output and project them into a final learned relation-specific embedding h i,r . The computation process is formulated as follows: where W O ∈ R Kd m ×d l+1 is a learnable weight matrix, and d m is the dimension of attention heads with d m = d l+1 /M. Therefore, with the reduced dimension d k of each head, the total cost of computation for multi-head attention is similar to that of single-head attention with dimension d l+1 .

Relation-Level Attention
In our graph, nodes are connected to other nodes through multiple types of relations (e.g., recipe to user, recipe to ingredient, and recipe to recipe), while each relation-specific node embedding can only represent the node information from one perspective. Therefore, the relation-specific node embeddings around a node need to be fused to learn a more comprehensive node embedding. To address the challenge of selecting relation and fusing multiple relation-specific node embeddings, we propose a relation-level attention module to learn the importance of different relations and automatically fuse them. Specifically, we use the relation-level attention module to combine all relation-specific embeddings h i,r and generate the final node embedding h i . We first use a shared nonlinear weight matrix W R to transform the relation-specific node embeddings. Then we use a relationlevel intermediary vector q to calculate the similarity with transformed embeddings, which is also used as the importance of each relation-specific node embedding. After that, we average the importance of all relation-specific node embeddings for relation r to generate the importance score w i,r for node v i . The process is shown as follows: where W R ∈ R d l+1 ×d l+1 is a nonlinear weight matrix, b ∈ R d l+1 is the bias vector, q ∈ R d l+1 is the relation-level intermediary vector, and V r denotes the set of nodes under relation r. To make the coefficients comparable across different relations, we normalize w i,r to get the relation-level attention vector β i,r for each relation r using the softmax function. The normalization process is formulated as: where R i indicates the associated relations of node v i . Here, the generated attention vector β i,r can be explained as the contribution of relation r to node v i . Apparently, the higher the β i,r , the more important the relation r is. Since different relations may contribute differently to the training objective, the relation-level attention vector for each relation could have different weights accordingly. Therefore, we fuse the relationspecific node embeddings h i,r with the relation-level attention to obtain the final node embedding h i . The process is demonstrated as follows: Here, the final node embedding h i can be interpreted as the optimally weighted combination of relation-specific node embeddings, while each relation-specific node embedding is an optimally weighted combination of the node embeddings that share the same relation.

Recipe Recommendation
Above we discuss how to propagate and learn node embedding h i in layer l + 1. However, how to take advantage of these informational node embeddings and make recipe recommendations remains a challenge. In this section, we introduce how we leverage these embeddings to make proper recommendations. Specifically, suppose we propagate and update the node embeddings through L layers of GNN with both node-level and relation-level attentions encoded, where L is a hyperparameter, we obtain L representations generated by each layer for each node. For example, given a node v i with type φ i , the learned node embeddingĥ i can be denoted as The process can be formulated as follows: Since the representations generated by different layers underline the combination of messages passed over different orders of connections, they represent the node information from different perspectives. As such, they have different contributions in reflecting the node information. Therefore, we concatenate them to develop the final embedding for each node. The concatenation process can be shown as follows: where l indicates the layer and is the concatenation operation. Accordingly, we not only enrich the last layer of node embedding with the embedding from the former propagation layers but also enable the power to supervise the range of propagation by controlling the parameter L. Here, we only apply concatenation to combine these different layers of embeddings out of simplicity and effectiveness (Xu et al., 2018). Still, other operations can also be leveraged such as max pooling, LSTM (Hochreiter and Schmidhuber, 1997), or weighted average. These aggregators suggest different assumptions in aggregating the embeddings. We left this to explore in future work. Finally, we leverage a score predictor to make recommendations based on the learned user embeddings and recipe embeddings. In particular, given a user u with embedding h u and a recipe with embedding h r , the score predictor can take them as the input and generate a score to indicate if the model should recommend this recipe to the user. The score is ranged between 0 and 1. When the score is closer to 1, it means the recipe should be recommended and vice versa. The recommendation process is demonstrated as follows: where sp is the score predictor and s u,r is the recommendation score of recipe r to user u. We further compare multiple functions for sp including inner product, cosine similarity, and multi-layer perceptron. The one that renders the highest performance is selected for our model (i.e., inner product). Further details are illustrated in section 4.7.3. To learn the model parameters, we employ a ranking-based objective function to train the model. In particular, the objective involves comparing the recommendation scores between nodes connected by a user-recipe relation against the scores between an arbitrary pair of user and recipe nodes. For example, given an edge connecting a user node u and a recipe node r, we encourage the recommendation score between u and r to be higher than the score between u and a randomly sampled negative recipe node r ′ . We formulate the objective L as follows: where s u,r and s u,r ′ are the recommendation score between user u and correct recipe r as well as incorrect recipe r ′ , respectively, U denotes the user set, and N u indicates the recipe neighbors of user u.

EXPERIMENTS
In this section, we conduct extensive experiments with the aim of answering the following research questions: • RQ1: How does HGAT perform compared to various baseline methods on recipe recommendation? • RQ2: How do different components, e.g., node-level attention or relation-level attention affect the model performance? • RQ3: How do various hyper-parameters, e.g., number of embedding dimensions and propagation layers, impact the model performance?

Dataset
We extract the recipes from Reciptor (Salvador et al., 2017;Li and Zaki, 2020) and only use those recipes that have comprehensive information (i.e., with the quantity and unit indicated for each ingredient). All these recipes are collected from Food.com, which is one of the largest recipe-sharing platforms online. To formulate the real-life user-recipe interactions, we crawl the user ratings for each recipe from the platform and leverage this information for the recipe recommendation task. For each ingredient that appeared in the dataset, we further match them to the USDA nutritional dataset (U.S. Department of Agriculture, 2019) to get the nutritional information. Next, we construct our recipe graph by transforming the recipes, users, and ingredients into nodes with the type "recipe, " "user, " and "ingredient, " respectively. After that, we build four types of edges among these nodes to connect them. In particular, we first connect each recipe and its ingredients with an edge, denoted as recipe-ingredient relation, while the weight of each ingredient is used as the edge weight. We then connect recipe nodes by their similarity from FoodKG (Haussmann et al., 2019) and the score is used as the edge weight, as shown in Reciptor (Li and Zaki, 2020). We further connect ingredient nodes by the co-occurring probabilities using Normalized Pointwise Mutual Information (NPMI) (Bouma, 2009) from FlavorGraph (Park et al., 2021), as shown in the KitcheNette (Park et al., 2019). Moreover, we construct edges between users and recipes based on the interactions, while the ratings are treated as the edge weight. The statistics of the constructed recipe graph are provided in Table 1.

Experimental Setup
We employ the leave-one-out method to evaluate the model performance, which is widely utilized in existing recommendation studies (He et al., 2016Bayer et al., 2017;Jiang et al., 2019). Specifically, for each user, we leave one positive recipe out as the test data, one positive recipe out for validation, and used the remaining positive recipes for training. In the testing period, we randomly sampled 100 negative recipes for each user and evaluated the model performance using Recall and Mean Reciprocal Rank (MRR) metrics. We reported the performance under top@K, while K ranges from 1 to 10. The definitions of these two metrics are illustrated as follows: • Recall@K. It shows the ratio of correct recipes being retrieved in the top@K recommendation list, which is computed by: where U test is the set of users in test data for evaluation, RE(u) indicates the top@K recommendation list for user u, and GT (u) denotes the ground truth recipe set for user u. • MRR@K. It measures the ranking quality of the recommendation list, whis is defined as: where GT K(u) denotes the ground truth recipes that appear in the top@K recommendation list for user u, andr(v) represents the ranking position of the recipe in the recommendation list.

Baseline Methods
We compare HGAT with seven baseline methods, including classic recommendation approaches, recipe representation learning methods, and graph embedding models.
• BPR (Rendle et al., 2009): A competitive pairwise matrix factorization model for recommendation, which is also one of the state-of-the-art algorithms used widely in recipe recommendation task (Trattner and Elsweiler, 2019). • IngreNet (Teng et al., 2012): A recipe recommendation approach by replacing the popular ingredient list with the co-occurrence count extracted from the constructed ingredient network. A GCN layer is applied on the user-recipe graph to learn the embeddings jointly for a fair comparison. • NeuMF : One of the state-of-the-art neural collaborative filtering models that use neural networks on user and item embeddings to capture their nonlinear feature interactions. • matapath2vec (Dong et al., 2017): A heterogeneous graph embedding method based on random walk guided by metapath. Here, we use meta-path user-recipe-ingredient-recipeuser. • GraphSAGE (Hamilton et al., 2017): A graph neural network model that learns embeddings by aggregating and sampling the features of local neighbors. • GAT (Velickovic et al., 2018): A graph attention network model that leverages the attention mechanism to aggregate neighbor information on the homogeneous graphs. • Reciptor (Li and Zaki, 2020): One of the state-of-the-art recipe embedding models based on the set transformer and optimized by an instructions-ingredients similarity loss and a knowledge graph based triplet loss. A GCN layer is applied to each relation to jointly train user and recipe embedding for a fair comparison.

Implementation Details
For the proposed model HGAT, we set the learning rate to 0.005, the number of node-level attention heads to 4, the hidden size to 128, the input dimension of skip-instruction embeddings to 1,024, the input dimension of ingredient embeddings to 46, batch size to 1,024, and the training epochs to 100. We optimize the model with Adam (Kingma and Ba, 2014) and decay the learning rate exponentially by γ = 0.95 every epoch. For random walk based graph representation learning algorithms including DeepWalk and metapath2vec, we set the window size to 5, walk length to 30, the number of walks rooted at each node to 5, and the number of negative samples to 5. For homogeneous graph representation learning approaches including DeepWalk, GAT, and GraphSage, we ignore the heterogeneity of nodes and perform the algorithm on the whole graph. For a fair comparison, we set the embedding dimension to 128 for all above models except for Reciptor as we follow the original setup and use 600 as the embedding dimension.

Performance Comparison (RQ1)
We use the Recall and MRR as the evaluation metrics. The performances of all models are reported in Table 2. The best results are highlighted in bold. According to the table, we can find that our model HGAT outperforms all the baselines in all cases. Specifically, traditional collaborative filtering recommendation approaches such as BPR and NeuMF perform poorly because they neither consider the hidden relational information nor the ingredients associated with each recipe. Recipe recommendation model IngreNet obtains decent performance after incorporating the GCN to leverage the relational information. Reciptor fails to perform well because the model only learns representations for recipes while overlooking the user information. Even applying a GCN layer to learn the user embeddings jointly cannot fully encode the information. However, homogeneous graph representation learning approaches (i.e., GraphSage and GAT) achieve satisfactory results for Recall but perform poorly for MRR. This is because the node type information is important in modeling the graph structure. Ignoring this information prevents the model from learning comprehensive node embeddings and failing to rank the correct recipe higher within the returned recommendation list. On the contrary, metapath2vec, as a heterogeneous graph representation learning algorithm, performs well for both Recall and MRR. Finally, the proposed model, HGAT, achieves the best performance compared to all the baseline methods by incorporating recipe content, high-order interactions, relational information, and leveraging attention mechanisms to encode different types of nodes and relations. In general, compared to the best baseline, HGAT improves the recall score by +5.42%, +5.32%, +6.40%, +6.41%, +7.01%, +7.18%, +7.88%, +8.42%, +9.06%, and +9.53% on k ranges from 1 to 10, respectively. When it comes to MRR score,

Ablation Studies (RQ2)
HGAT is a joint learning framework composed of several neural network modules. How do different components impact the model performance? To answer this question, we conduct ablation studies to evaluate the performances of several model variants including: • HGAT mean : a model variant that uses neither node-level attention nor relation-level attention. Instead, it uses a mean operator to combine the neighbor node features, and relation features. • HGAT pool : a model variant that uses neither node-level attention nor relation-level attention. Instead, it uses a pooling operator to combine the neighbor node features and a mean operator to combine the relation features. • HGAT nAtt : a model variant that uses node-level attention to fuse neighbor node features, and mean operator to combine the relation features. • HGAT: the proposed model that leverages both node-level attention and relation-level attention.
The results are reported in Table 3. From this table: • HGAT pool has better performance than HGAT mean , indicating the way how to aggregate neighbor nodes information is important, and simply using the mean operator to combine the neighbor node messages could lose some information. • HGAT nAtt performs better than HGAT pool and HGAT mean , demonstrating the effectiveness of node-level attention and illustrating using attention mechanism advances using mean or pooling operator in aggregating the neighbor nodes information.

Parameter Sensitivity (RQ3)
To estimate the proposed models' sensitivity to hyperparameters, we conducted many contrast experiments to measure the performance of HGAT under different hyperparameter settings. We start by exploring the influence of embedding dimensions, as it usually plays a pivotal role in data modeling. We then analyze the impact of propagation layer numbers to show the importance of modeling relational information. Moreover, we study how different score predictors affect recommendation performance.

Impact of Different Embedding Dimensions
We report the performance of Recall@5, MRR@5, Recall@10, and MRR@10 with respect to the number of embedding dimensions in Figure 4. Specifically, we search the number of embedding dimensions within {16, 32, 64, 128, 256, 512} and evaluate the performance of the proposed HGAT and two best baselines (i.e., metapath2vec and GAT). From the figure: • Increasing the number of embedding dimensions improves the model performance. Clearly, all models achieve the highest score on the 2 metrics when using the 256 dimensions. This is because more dimensions could have more capacity to represent the node content. • Further increasing the number of embedding dimensions to 512 leads to overfitting. This might be projecting the representation into a higher-dimensional space introduced noise. • When varying the number of embedding dimensions, HGAT is consistently superior to other methods on different setups. This demonstrates the capability of HGAT compared to other approaches.

Impact of Different Propagation Layers
We vary the model depth and test performance to investigate whether the proposed HGAT can benefit from multiple embedding propagation layers. Specifically, we search the layer numbers in {1, 2, 3, 4}. Experimental results are reported in   Table 4, wherein HGAT-2 indicates the model with 2 embedding propagation layers and similar notations for others. By analyzing the Table 4, we have the following observations: • Equipping HGAT with more propagation layers substantially enhances the recommendation performance. Clearly, HGAT-2 and HGAT-3 achieve consistent improvement over HGAT-1 in all cases, given HGAT-1 only considers the first-order neighbor nodes. We attribute this improvement to the effective modeling of relational information: relational information are carried by second-order and third-order connectivities, and relational information can be modeled by encoding these interactions. • When further stacking propagation layer on the top of HGAT-3, we find that HGAT-4 leads to overfitting on the dataset. This is because applying a too deep architecture might introduce noises into modeling. Similar results to HGAT-3 verifies that conducting three propagation layers is sufficient to capture the relational information. • By comparing the results in Table 4 to Table 2 (which report the HGAT-2 performance), we can find HGAT is consistently superior to other methods. This again verifies the effectiveness of HGAT, empirically showing that explicit modeling of highorder interactions and relational information can greatly facilitate the modeling and further improve the performance in recommendation tasks.

Impact of Different Score Predictors
To show the effectiveness of the used score predictor in our model, we testify the effect of different score predictors and report the performance in Table 5. In particular, we FIGURE 5 | Visualization of the user and recipe embeddings generated by different models. Red nodes represent the user while blue nodes represent the recipe. An edge connecting a user and a recipe indicates the interaction between them.
use different similarity functions, namely, cosine similarity, multi-layer perceptron (MLP), and inner product. We evaluate the performance of different functions on Top@K recommendations where K ranges from 1 to 10. From the table: • Inner product is the best function to calculate the similarity score and cosine is the worst. This might be because the cosine similarity only cares about the angle difference between 2 vectors, which fails to capture the complexity of the learned embeddings, while the inner product considers both the angle and the magnitude. • MLP achieves similar performance on inner product under Recall but still performs worse when using MRR to evaluate, which is consistent with the findings in paper (Rendle et al., 2020). This further shows the compatibility of the inner product.

Case Study: Embedding Visualization
For a more intuitive understanding and comparison, we randomly select 8 user-recipe interaction pairs and generate the visualization of their embeddings using t-SNE (van der Maaten and Hinton, 2008). As shown in Figure 5, we can find that IngreNet does not perform well. The model can roughly separate the users and recipes into the left and right parts, but there are gaps within each cluster. Also, the lines between clusters are disordered. Ideally, there should develop a clear mapping between each user-recipe pair. In other words, if we connect for each pair, the connecting lines should be parallel to each other [similar to the "king-man=queen-woman" relationship (Mikolov et al., 2013)]. GAT can successfully form the users and recipes into two condense clusters, but fail to construct the parallel lines between them. Metapath2vec builds two clusters for users and recipes and establish parallel lines between clusters in some sense. Examples are the "1404-57276" and "1427-50599" userrecipe pairs. However, the formed clusters are not condensed, and only a few of the lines are parallel to each other. Finally, our model HGAT can easily form the users and recipes into 2 condensed clusters and obtain parallel lines for almost all user-recipe pairs. This further demonstrates the superiority of our model.

CONCLUSION
In this paper, we propose to leverage the relational information into recipe recommendation. To achieve this, we design HGAT, a novel hierarchical graph attention network for solving the problem. HGAT is able to capture user history behaviors, recipe content, and relational information through several neural network modules. We further introduce a score predictor and a ranking-based objective function to optimize the model. Extensive experiments demonstrate that HGAT outperforms numerous baseline approaches. In the future, we plan to incorporate more information and improve HGAT. We observe that there are still plenty of useful information components that we can use such as user reviews and recipe health factors. One promising direction is to investigate how to make recipe recommendation that fit user preferences and health concerns.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

AUTHOR CONTRIBUTIONS
YT, CZ, RM, and NC contributed to the overall design of the study. YT conducted the experiments. CZ performed the interpretation of results. YT wrote the first draft of the manuscript. All authors contributed to manuscript revision and approved the submitted version.