Deep Semantic Understanding and Sequence Relevance Learning for Question Routing in Community Question Answering

Question routing (QR) aims to route newly submitted questions to the potential experts most likely to provide answers. Many previous works formalize the question routing task as a text matching and ranking problem between questions and user profiles, focusing on text representation and semantic similarity computation. However, these works often fail to extract matching features efficiently and lack deep contextual textual understanding. Moreover, we argue that in addition to the semantic similarity between terms, the interactive relationship between question sequences and user profile sequences also plays an important role in matching. In this paper, we proposed two BERT-based models called QR-BERT rep and QR-tBERT int to address these issues from different perspectives. QR-BERT rep is a representation-based feature ensemble model in which we integrated a weighted sum of BERT layer outputs as an extra feature into a Siamese deep matching network, aiming to address the non-context-aware word embedding and limited semantic understanding. QR-tBERT int is an interaction-based model that explores the interactive relationships between sequences as well as the semantic similarity of terms through a topic-enhanced BERT model. Specifically, it fuses a short-text-friendly topic model to capture corpus-level semantic information. Experimental results on real-world data demonstrate that QR-BERT rep significantly outperforms other traditional representation-based models. Meanwhile, QR-tBERT int exceeds QR-BERT rep and QR-BERT int with a maximum increase of 17.26% and 11.52% in MAP, respectively, showing that combining global topic information and exploring interactive relationships between sequences is quite effective for question routing tasks.


Introduction
Community Question Answering (CQA) is an online service that enables users to post questions and obtain answers from other users, which has proven to be a very effective way of sharing knowledge and experience.Recently, with the increasing number of questions that cannot be answered in time, much concern has arisen over the efficiency and answer quality of CQA services [28].Therefore, routing the newly posted question to the right user for quick and accurate answer is an important strategy to maintain user engagement and the vibrancy of the CQA platform.Modeling the similarities and relevance between users' profiles and questions is critical in the textual content-based question routing approaches.User modeling is generally based on the user's historical answer record, and all the answers provided by the user in the past are collected to form the user's profile.When we treat users' profiles as documents and questions as queries, the question routing task can be viewed as a classic text matching and ranking problem [24].Finding and sorting documents that match the query is equivalent to finding the best expert who can answer the question.
Text understanding plays a vital role in matching and ranking.Traditional methods mainly include language models and topic models, which heavily rely on lexical overlap or word co-occurrence [4,38].However, these methods have very limited text understanding ability and are usually unable to capture deep and complex semantics efficiently, leading to unsatisfactory results.Recently, along with the rapid development of distributed word embedding and deep learning, neural ranking networks have been applied to question routing tasks [25,36,1,31,15].Most of these neural models are representation-based that first turn the question and user profile into vectors using word embedding (Word2Vec [21], GloVe [24]), and then use a typical neural network (e.g., CNNs or RNNs) to extract patterns and construct dense meaningful feature vectors separately.Finally, the semantic similarity is calculated for further ranking.
Although existing deep neural representation-based methods have achieved promising performance, they have several shortcomings: First, traditional word embedding is static, which means it fails to distinguish the term's meaning in different scenarios [9].Second, feature extractors are mainly based on CNNs or LSTMs, however, CNN-based methods [36,37] usually have a limited receptive field to capture long-distance depen-dencies, and LSTM-based methods [7] are difficult to parallelize.Third, they mainly focus on matching the semantic similarity level while ignoring sequence interactions.In fact, text matching is very complicated in CQA, the relationships between the two sequences are also important factors in matching.For the specific task of question routing, matching questions and users can be seen as matching questions and answers since the user's profile is composed of answers.In general, questions and answers not only share terms and topics, but are often logically connected, and exploring the semantic similarity of terms alone is not enough to achieve good matching performance.
Recently, the pre-trained bidirectional contextual language model BERT [8] has brought unprecedented performance gains in text understanding tasks, and we expect to adopt it to improve the performance of question routing in community Q&A as well.However, there are some special challenges that need to be addressed.First, the low average participation rate of users and the relatively short length of questions in Q&A communities lead to severe data sparsity problems, which result in insufficient textual content for user modeling and question modeling.According to studies [18,20], most answers come from very few users, in Quora, a well-known community Q&A website, 90% of the questions got less than 10 answers, more than 30% of users did not answer any questions, and only 16.74% of users answered more than 4 times [30].Second, modeling the similarities and relevance between question-user pairs is challenging due to a large number of domain-specific terms and the fact that there is little direct lexical overlap between question sequences and user profile sequences.
Based on the above analysis, in this paper, we propose two novel models from different perspectives to address the question routing task: a tag-word topic-enhanced interaction-based method called QR-tBERT int and a combined representation-based model called QR-BERT rep .Specifically, in QR-tBERT int , we take questions and user profiles as query-document pairs, and they are concatenated into a longer sequence as input.By fine-tuning BERT on our task-specific dataset, contextual semantic learning and question-profile pair relationship exploration are integrated into a unified model.In addition, we innovatively incorporate a tag-word topic model to handle domain-specific terms in QR-tBERT int .And in another model QR-BERT rep , we incorporate the contextualized embeddings learned from the pre-trained model into an existing Siamese deep learning-based matching model to enhance the semantic understanding.The main contributions of this paper are as follows: _ We propose two novel BERT-based deep neural models to solve the question routing task from different perspectives: representation-focused and interaction-focused.Specifically, we adopt different strategies for modeling similarities between questions and user profiles in different models and propose to explore the interactive relationships between question sequences and user profile sequences.Our research can provide experiences and references for other domainspecific text understanding tasks in CQA._ We incorporate a corpus-level tag-word topic model to learn the global matching feature and topic semantic information in QR-tBERT int to help handle domain-specific cases, and we combine the contextualized embeddings with the traditional word embeddings to construct more meaningful representations in QR-BERT rep to enhance the matching performance._ We conducted a detailed experimental study using a real-world dataset from Stack Overflow.We evaluated the performance of our two methods and compared them with several baseline approaches.
The experimental results show that our approaches yield satisfactory performance and significantly outperform the baseline approaches.
The rest of this paper is organized as follows.In Section 2, we review the related work.Section 3 details the proposed two models.In Section 4, we present our dataset and experimental setting.Finally, we present our experimental results and discussion in Section 5. Section 6 concludes this paper.

Question Routing
Question routing is a fundamental task that has been widely studied in social communities and is also referred to as expert finding or expert recommendation in many studies.Statistical language models [4,3,40] and topic models [39,32] have played an essential role for a long time.Although they can solve question rout-ing tasks, they all lack deep textual understanding and fail to capture complex semantic features.Recently, deep learning technologies have brought a revolutionary way to solve question routing issues with more concise and efficient architectures [25,36,1,31,15].A method to directly apply deep neural networks to question routing was proposed by Azzam et al. [1] based on Deep Semantic Similarity Model [11].In this model, questions and the users' profiles are mapped to a low-dimensional semantic space through a deep neural network, and the similarity score is computed using the cosine similarity function.Later, CNNs and LSTMs were gradually introduced into the NLP field, bringing significant improvements.Wang et al. [31] designed a variant of CNN architecture to capture the semantics of the text for expert recommendation tasks.Chen et al. [37] described an effective convolutional neural network with three filters of different sizes to learn the representations of questions and answers to identify experts.In another work [36], the LSTM(long-short term memory) [10] network has been employed to learn the question embedding instead of CNN in the Quora dataset.More recently, Li et al. [15] proposed to combine the embedding of the question raiser learned by a heterogeneous information network representation algorithm with the embedding of the question content to enhance the characterization of the question.However, the performance of these representation-based deep learning methods often suffers from data sparsity and inefficient feature extraction.In addition, they encode the question and the user's profile as two separate sequences, facing the risk that the interaction between the text sequences could be ignored.Different from the above studies, in this paper, we not only incorporate contextualized embeddings obtained from an efficient self-attentive mechanism-based feature extractor into traditional word embeddings to improve representation performance but also propose to explore the interactive relationships between question and user sequences in addition to focusing on text semantic learning.

Pre-training Language Models
The pre-training language model aims to learn word embeddings or representations with prior semantic knowledge by performing pre-training tasks from a large number of unlabeled corpora.Researchers from the Google company released an exciting bidirection-al language representation model BERT [8], aiming to solve the unidirectional constraints in GPT [26] and extend the model to multi-layer bidirectional Transformer [29] blocks, achieving the best performance in many NLP tasks such as machine translation, text classification, and question retrieval.
Many recent works have also introduced BERT to solve question routing or expert recommendation tasks and achieved relatively good results beyond the traditional approaches.However, these works mainly use pre-trained BERT models as encoders and feature extractors, and the potential of BERT models is not fully exploited, which has a limited effect on improving the overall question routing performance.For example, Zhang et al. [35] conducted a Temporal Context-aware Question Routing model in which BERT is only used to encode the question content.Peinelt et al. [23] proposed a semantic enhancement approach that combines BERT embeddings with LDA-based topics for semantic similarity prediction on the Quora dataset, which achieves better performance than vanilla BERT.However, the above approach cannot be directly applied to our task due to the need to learn programming-specific terms in our dataset and the fact that the length of the questions is too short which leads to difficulty in topic derivation.Therefore, we use a more targeted topic model, the tag-word topic model, to learn specific domain terms and provide corpus-level semantic information.

Our Proposed Models for Question Routing
In this section, we will describe two BERT-based models named QR-BERT rep and QR-tBERT int to address the question routing task.In brief, QR-BERT rep designs a feature ensemble method in which each text sequence goes through the pre-trained BERT network separately.Then, the outputs by the last four highest-level Transformer layers corresponding to each input token in different positions are extracted as additional textual features and incorporated into a Siamese neural matching network.Compared to QR-BERT rep , QR-tBERT int concatenates two text sequences to a longer sequence and adopts a more flexible way by fine-tuning to learn the interaction between questions and users from the beginning.In

Community Question Answering: Stack Overflow
First, we introduce the necessary background of Stack Overflow, including the main characteristics and the question-answering mechanism.There are several important components in a Q&A thread: 1 Questions are the central element of Stack Overflow, which includes the title, body, and tags.The life cycle of a question begins with an open state in which any user can provide an answer to the question.Subsequently, when the questioner chooses the best answer, or other users choose the best answer by voting, the question is considered solved and no more answers are received.
2 Answers are provided by different users and can be voted on by other users.The more votes an answer receives, the higher the approval and the better the quality.In addition to the best answer, all other answers are sorted in a thread in descending order of votes.
3 Best (Accepted) Answer is selected by the questioner or selected by other users in which the answer received the largest number of votes.Each question has only one best answer.It is at the top of the answer list.
4 Users include questioners and respondents whose basic information is displayed under questions or answers related to them.
5 Tags are assigned by the questioner and represent the knowledge area relevant to the question.A CQA website has accumulated an enormous number of question-answer threads that provide a plethora of textual information for us to explore.

Problem Statement
As detailed above, a CQA dataset is built upon the static archive of the CQA website, which preserves all the question-answer threads accumulated over time.Let Q be a question set Q = {q 1 , q 2 , ..., q n } (n is the number of questions) and U be an answerer set U = {u 1 , u 2 , ..., u j } (j is the number of users).For each answerer u j , a document p i is a combination of all the best answers provided by u j , and p i is referred to as the profile of that answerer.
Using the above notations, we formalize the question routing task as a text match and ranking problem and define it as follows: Given a newly posted question q, let a set of C ∈ U be a candidate set C = {c 1 , c 2 , ..., c k } (k is the number of candidates), and let a set of candidate profiles P = {p 1 , p 2 , ..., p k } We need to rank users in C and route q to the highly ranked users, who are most suitable to answer the question q with the required knowledge.An essential part of this task is learning the match patterns and capturing the relationships between question q and the profile of the candidate p i ∈ P. Specifically, in our work, we need to estimate a score r i of how relevant a candidate user's p i is to a newly posted question q or we need to calculate a probability Pr of how likely a user's profile p i is given the question q.

Tag-word Topic Enhanced Interactionbased Approach: QR-tBERTint
In this section, we design a topic-enhanced BERTbased model named QR-tBERT int for the question routing task, the overall framework of which is illustrated in Figure 1.To learn the structural and textual relevance, we assemble the question sequence and user profile sequence into a longer text sequence and encode it with the stacked Transformer blocks.We take the special embedding of the first token in the last layer as the fusion relevance representation of the combined sequences.Meanwhile, a tag-word topic model TTM [6] is adopted to derive high-quality topics by building tag-word co-occurrence on the corpus level, thereby helping to enhance the domain-specific knowledge understanding and relieve the data sparsity problem.Based on previous studies which successfully combined the corpus-level topic with neural networks [23,33,22], we take the concatenation of the sequence pair fusion representation S obtaining form BERT and sequence-level topic representations T Q and T P obtaining from the tag-word topic model as the final representation F and send it to the next task-specific ranking layers: Using the above notations, we formalize the question routing task as a text match and ranking problem and define it as follows: Given a newly posted question , let a set of  ∈  be a candidate set  = { 1 ,  2 ,… ,   } (k is the number of candidates), and let a set of candidate profiles  = { 1 ,  2 , … ,   }.We need to rank users in  and route  to the highly ranked users, who are most suitable to answer the question  with the required knowledge.An essential part of this task is learning the match patterns and capturing the relationships between question  and the profile of the candidate   ∈ .Specifically, in our work, we need to estimate a score   of how relevant a candidate user's   is to a newly posted question  or we need to calculate a probability  of how likely a user's profile   is given the question q.

Tag-word Topic Enhanced Interaction-based Approach: QR-tBERTint
In this section, we design a topic-enhanced BERT-based model named QR-tBERTint for the question routing task, the overall framework of which is illustrated in Figure 1.To learn the structural and textual relevance, we assemble the question sequence and user profile sequence into a longer text sequence and encode it with the stacked Transformer blocks.We take the special embedding of the first token in the last layer as the fusion relevance representation of the combined sequences.Meanwhile, a tag-word topic model TTM [6] is adopted to derive high-quality topics by building tag-word co-occurrence on the corpus level, thereby helping to enhance the domain-specific knowledge understanding and relieve the data sparsity problem.Based on previous studies which successfully combined the corpus-level topic with neural networks [23,33,22], we take the concatenation of the sequence pair fusion representation  obtaining form BERT and sequence-level topic representations   and   obtaining from the tag-word topic model as the final representation  and send it to the next task-specific ranking layers: where  ∈ ℝ  ,   ∈ ℝ  , and   ∈ ℝ  . denotes the number of topics.
Corpus-level topic representation module.Topic models have been shown to provide additional information to enhance text understanding and matching in earlier feature engineering-based models, and are particularly effective for dealing with domain-specific terms [38,32].In recent years, many deep neural methods have achieved impressive performance on many NLP tasks such as domain recommendation [33], semantic analysis [23,22], and machine translation [5] by combining topic models.However, extracting topics from relatively short texts in CQA and constructing an efficient fusion model to combine the corpus-level topic information is very challenging in the question routing task.According to the characteristics of Q&A threads described in Section 3.1, we use the unsupervised learning tag-word topic model [6] to derive corpus-level topical representations   ∈ ℝ  and   ∈ ℝ  for questions and users.First, we construct a tag-word pool by combining a word and a tag.For example, a question with two tags ( 1 ,  2 ) and three words ( 1 ,  2 ,  3 ) will generate six tag-words in the form of {( 1 ,  1 ), ( 1 ,  2 ), ( 1 ,  3 ), ( 2 ,  1 ), ( 2 ,  2 ), ( 2 ,  3 )}.Second, by considering the entire corpus as a mixture of topics whose distribution over topics comes from a Dirichlet allocation with priors α and assuming there are  topics whose distribution over tags and words are sampled from Dirichlet allocations with prior γ and β.The joint probability of a tagword can be formulated as: ( where S ∈ ℝ e , T Q ∈ ℝ K , and T P ∈ ℝ K .K denotes the number of topics.
Corpus-level topic representation module.Topic models have been shown to provide additional information to enhance text understanding and matching in earlier feature engineering-based models, and are particularly effective for dealing with domain-specific terms [38,32].In recent years, many deep neural methods have achieved impressive performance on many NLP tasks such as domain recommendation [33], semantic analysis [23,22], and machine translation [5] by combining topic models.However, extracting topics from relatively short texts in CQA and constructing an efficient fusion model to combine the corpus-level topic information is very challenging in the question routing task.According to the characteristics of Q&A threads described in Section 3.1, we use the unsupervised learning tag-word topic model [6] to derive corpus-level topical representations T Q ∈ ℝ K and T P ∈ ℝ K for questions and users.
First, we construct a tag-word pool by combining a word and a tag.For example, a question with two tags (t 1 , t 2 ) and three words (w 1 , w 2 , w 3 ) will generate six tag-words in the form of {(t 1 , w 1 ), (t 1 , w 2 ), (t 1 , w 3 ), (t 2 , w 1 ), (t 2 , w 2 ), (t 2 , w 3 )}.Second, by considering the entire corpus as a mixture of topics whose distribution over topics comes from a Dirichlet allocation with priors α and assuming there are K topics whose distribution over tags and words are sampled from Dirichlet allocations with prior γ and β.The joint probability of a tag-word can be formulated as: 7 ber of candidates), and let a set of candidate profiles  = { 1 ,  2 , … ,   }.We need to rank users in ute  to the highly ranked users, who are most suitable to answer the question  with the required e.An essential part of this task is learning the match patterns and capturing the relationships between  and the profile of the candidate   ∈ .Specifically, in our work, we need to estimate a score   of vant a candidate user's   is to a newly posted question  or we need to calculate a probability  of y a user's profile   is given the question q.
word Topic Enhanced Interaction-based Approach: QR-tBERTint ction, we design a topic-enhanced BERT-based model named QR-tBERTint for the question routing overall framework of which is illustrated in Figure 1.To learn the structural and textual relevance, ble the question sequence and user profile sequence into a longer text sequence and encode it with ed Transformer blocks.We take the special embedding of the first token in the last layer as the fusion representation of the combined sequences.Meanwhile, a tag-word topic model TTM [6] is adopted high-quality topics by building tag-word co-occurrence on the corpus level, thereby helping to the domain-specific knowledge understanding and relieve the data sparsity problem.Based on studies which successfully combined the corpus-level topic with neural networks [23,33,22], we oncatenation of the sequence pair fusion representation  obtaining form BERT and sequence-level resentations   and   obtaining from the tag-word topic model as the final representation  and send ext task-specific ranking layers: ∈ ℝ  ,   ∈ ℝ  , and   ∈ ℝ  . denotes the number of topics.evel topic representation module.Topic models have been shown to provide additional information e text understanding and matching in earlier feature engineering-based models, and are particularly for dealing with domain-specific terms [38,32].In recent years, many deep neural methods have impressive performance on many NLP tasks such as domain recommendation [33], semantic analysis and machine translation [5] by combining topic models.However, extracting topics from relatively ts in CQA and constructing an efficient fusion model to combine the corpus-level topic information allenging in the question routing task.According to the characteristics of Q&A threads described in 3.1, we use the unsupervised learning tag-word topic model [6] to derive corpus-level topical ations   ∈ ℝ  and   ∈ ℝ  for questions and users.construct a tag-word pool by combining a word and a tag.For example, a question with two tags nd three words ( 1 ,  2 ,  3 ) will generate six tag-words in the form of {( 1 ,  1 ), ( 1 ,  2 ), ( 1 ,  3 ), ( 2 ,  2 ), ( 2 ,  3 )}.Second, by considering the entire corpus as a mixture of topics whose distribution cs comes from a Dirichlet allocation with priors α and assuming there are  topics whose distribution and words are sampled from Dirichlet allocations with prior γ and β.The joint probability of a tagbe formulated as: �  ,   � = ∑ ()(  |) ∈ [1, ] denotes a topic,  ∽  (α) denotes a topic distribution for the whole collection,   ∽ enotes a topic-specific tag distribution, and   ∽  (β) denotes a topic-specific word distribution.the tag-word as the basic unit of the topic model and aggregate all tag-words from the whole corpus ng.After obtaining the tag-word topic model, the question sequence and the user profile sequence are the topic model to infer topic-level embedding per sequence: where k ∈ [1, K] denotes a topic, θ ~ Dir(α) denotes a topic distribution for the whole collection, φ k ~ Dir(γ) denotes a topic-specific tag distribution, and ϕ k ~ Dir(β) denotes a topic-specific word distribution.We take the tag-word as the basic unit of the topic model and aggregate all tag-words from the whole corpus for training.After obtaining the tag-word topic model, the question sequence and the user profile sequence are passed to the topic model to infer topic-level embedding per sequence: anced Interaction-based Approach: QR-tBERTint n a topic-enhanced BERT-based model named QR-tBERTint for the question routing ork of which is illustrated in Figure 1.To learn the structural and textual relevance, n sequence and user profile sequence into a longer text sequence and encode it with blocks.We take the special embedding of the first token in the last layer as the fusion of the combined sequences.Meanwhile, a tag-word topic model TTM [6] is adopted pics by building tag-word co-occurrence on the corpus level, thereby helping to cific knowledge understanding and relieve the data sparsity problem.Based on uccessfully combined the corpus-level topic with neural networks [23,33,22], we f the sequence pair fusion representation  obtaining form BERT and sequence-level nd   obtaining from the tag-word topic model as the final representation  and send c ranking layers: and   ∈ ℝ  . denotes the number of topics.sentation module.Topic models have been shown to provide additional information nding and matching in earlier feature engineering-based models, and are particularly h domain-specific terms [38,32].In recent years, many deep neural methods have ormance on many NLP tasks such as domain recommendation [33], semantic analysis anslation [5] by combining topic models.However, extracting topics from relatively onstructing an efficient fusion model to combine the corpus-level topic information question routing task.According to the characteristics of Q&A threads described in unsupervised learning tag-word topic model [6] to derive corpus-level topical and   ∈ ℝ  for questions and users.
-word pool by combining a word and a tag.For example, a question with two tags ( 1 ,  2 ,  3 ) will generate six tag-words in the form of {( 1 ,  1 ), ( 1 ,  2 ), ( 1 ,  3 ), }.Second, by considering the entire corpus as a mixture of topics whose distribution Dirichlet allocation with priors α and assuming there are  topics whose distribution sampled from Dirichlet allocations with prior γ and β.The joint probability of a tagas: s a topic,  ∽  (α) denotes a topic distribution for the whole collection,   ∽ pecific tag distribution, and   ∽  (β) denotes a topic-specific word distribution.the basic unit of the topic model and aggregate all tag-words from the whole corpus ing the tag-word topic model, the question sequence and the user profile sequence are l to infer topic-level embedding per sequence: umber of topics, [ 1 ,  2 , … ,   ] denotes the questions sequence, and tes the user profile sequence.

Representation Module based on BERT.
We take the linear concatenation of the profile tokens as input.For a given token   , its final input representation ℎ  0 ∈ ℝ  is word piece embedding   , the segment embedding   , and position embedding   of n to have at most 64 tokens, and the user profile is truncated to ensure that the stion, profile, and separator token has a maximum length of 512 tokens., when the input sequence passes through the multi-layer Transformer encoder blocks, quence are read by each Transformer encoder at once and learned by the self-attention , where K denotes the number of topics, [QT 1 , QT 2 , ..., QT N ] denotes the questions sequence, and [PT 1 , PT 2 , ..., PT M ] denotes the user profile sequence.

Sequence Pair Fusion Representation Module based on BERT.
We take the linear concatenation of the question tokens and the profile tokens as input.For a given token v i , its final input representation h i 0 ∈ ℝ e is constructed by summing word piece embedding v i , the segment embedding s i , and position embedding p i of the same dimension: where  denotes the number of topics, [ 1 ,  2 , … ,   ] denotes the questions sequence, and [ 1 ,  2 , … ,   ] denotes the user profile sequence.

Sequence Pair Fusion Representation Module based on BERT.
We take the linear concatenation of the question tokens and the profile tokens as input.For a given token   , its final input representation ℎ  0 ∈ ℝ  is constructed by summing word piece embedding   , the segment embedding   , and position embedding   of the same dimension: We truncate the question to have at most 64 tokens, and the user profile is truncated to ensure that the concatenation of the question, profile, and separator token has a maximum length of 512 tokens.
As illustrated in Figure 1, when the input sequence passes through the multi-layer Transformer encoder blocks, the tokens of the entire sequence are read by each Transformer encoder at once and learned by the self-attention mechanism that results in contextualized embeddings at different positions in each layer.Specifically, each Transformer layer Trm has two sublayers: MultiSelf and PFFN.The former is a multi-head self-attention mechanism-based network, while the latter is a position-wise fully connected feed-forward network which consists of two linear transformations with Gaussian Error Linear Unit (GELU) activation in between.In our task, we believe that the multi-head attention mechanism can capture different types of token relationships by using different attention matrices, and the self-attention mechanism spans the entire sequence of questions and user profiles so that question-profile interactions are learned.The specific formulations of these two sublayers can be found in the [4] and will not be repeated here.Based on the two sublayers, a residual connection around each of the two sub-layers and dropouts to the output of each sub-layer is applied [2].In summary, the hidden representation of each layer is shown as follows: After  layers that hierarchically exchange information across all positions in the previous layer, we obtain the final output   for all tokens of the input sequence.And next, we should perform candidate answerer ranking by using this feature embedding and then route the newly posted question to the candidate answerers that are ranked higher.Output.We use the softmax function to obtain the probability of the profile being relevant: where   ∈ ℝ ×(+2) is the learnable projection matrix and   is bias terms.C is the number of labels.We compute this probability for each candidate independently and obtain the final list of experts (profiles) by ranking them with respect to these probabilities.Fine-tune and Training.We use a BERTBASE model (hidden size of 768, 12 Transformer blocks, and 12 selfattention heads) as a binary classification model.We start training from it and fine-tune it to our question routing task using the cross-entropy loss.Specifically, limited by the size of our training corpus, we freeze the weights of the first few layers of the pre-trained network during fine-tuning.We believe that a well-trained BERT model fully incorporates context information at each token position and contains sentence relationship information in the embedding of [CLS].The loss is shown in Equation (10).(10) where   denotes the relevant score of the question and user,  + is the set of indexed answerers (positive label) and  − is the set of indexed random non-answerers (negative label).The model is fine-tuned by minimizing the cross-entropy loss. (5) We truncate the question to have at most 64 tokens, and the user profile is truncated to ensure that the concatenation of the question, profile, and separator token has a maximum length of 512 tokens.
As illustrated in Figure 1, when the input sequence passes through the multi-layer Transformer encoder blocks, the tokens of the entire sequence are read by each Transformer encoder at once and learned by the self-attention mechanism that results in contextualized embeddings at different positions in each layer.Specifically, each Transformer layer Trm has two sublayers: MultiSelf and PFFN.The former is a multi-head self-attention mechanism-based network, while the latter is a position-wise fully connected feed-forward network which consists of two linear transformations with Gaussian Error Linear Unit (GELU) activation in between.In our task, we believe that the multi-head attention mechanism can capture different types of token relationships by using different attention matrices, and the self-attention mechanism spans the entire sequence of questions and user profiles so that question-profile interactions are learned.The specific formulations of these two sublayers can be found in the [29] and will not be repeated here.Based on the two sublayers, a residual connection around each of the two sub-layers and dropouts to the output of each sub-layer is applied [2].In summary, the hidden representation of each layer is shown as follows: where  denotes the number of topics, [ 1 ,  2 , … ,   ] denotes the questions sequence, and denotes the user profile sequence.

Sequence Pair Fusion Representation Module based on BERT.
We take the linear concatenation of the question tokens and the profile tokens as input.For a given token   , its final input representation ℎ  0 ∈ ℝ  is constructed by summing word piece embedding   , the segment embedding   , and position embedding   of the same dimension: We truncate the question to have at most 64 tokens, and the user profile is truncated to ensure that the concatenation of the question, profile, and separator token has a maximum length of 512 tokens.As illustrated in Figure 1, when the input sequence passes through the multi-layer Transformer encoder blocks, the tokens of the entire sequence are read by each Transformer encoder at once and learned by the self-attention mechanism that results in contextualized embeddings at different positions in each layer.Specifically, each Transformer layer Trm has two sublayers: MultiSelf and PFFN.The former is a multi-head self-attention mechanism-based network, while the latter is a position-wise fully connected feed-forward network which consists of two linear transformations with Gaussian Error Linear Unit (GELU) activation in between.In our task, we believe that the multi-head attention mechanism can capture different types of token relationships by using different attention matrices, and the self-attention mechanism spans the entire sequence of questions and user profiles so that question-profile interactions are learned.The specific formulations of these two sublayers can be found in the [4] and will not be repeated here.Based on the two sublayers, a residual connection around each of the two sub-layers and dropouts to the output of each sub-layer is applied [2].In summary, the hidden representation of each layer is shown as follows: −1 = LayerNorm � −1 + Dropout�   ( −1 )��.(8) After  layers that hierarchically exchange information across all positions in the previous layer, we obtain the final output   for all tokens of the input sequence.And next, we should perform candidate answerer ranking by using this feature embedding and then route the newly posted question to the candidate answerers that are ranked higher.Output.We use the softmax function to obtain the probability of the profile being relevant: = softmax(  •  +   ), (9) where   ∈ ℝ ×(+2) is the learnable projection matrix and   is bias terms.C is the number of labels.We compute this probability for each candidate independently and obtain the final list of experts (profiles) by ranking them with respect to these probabilities.Fine-tune and Training.We use a BERTBASE model (hidden size of 768, 12 Transformer blocks, and 12 selfattention heads) as a binary classification model.We start training from it and fine-tune it to our question routing task using the cross-entropy loss.Specifically, limited by the size of our training corpus, we freeze the weights of the first few layers of the pre-trained network during fine-tuning.We believe that a well-trained BERT model fully incorporates context information at each token position and contains sentence relationship information in the embedding of [CLS].The loss is shown in Equation (10)., where  denotes the number of topics, [ 1 ,  2 , … ,   ] denotes the questions seque [ 1 ,  2 , … ,   ] denotes the user profile sequence.

Sequence Pair Fusion Representation Module based on BERT.
We take the linear co question tokens and the profile tokens as input.For a given token   , its final input represe constructed by summing word piece embedding   , the segment embedding   , and positio the same dimension: We truncate the question to have at most 64 tokens, and the user profile is truncated concatenation of the question, profile, and separator token has a maximum length of 512 to As illustrated in Figure 1, when the input sequence passes through the multi-layer Transform the tokens of the entire sequence are read by each Transformer encoder at once and learned b mechanism that results in contextualized embeddings at different positions in each layer.
Transformer layer Trm has two sublayers: MultiSelf and PFFN.The former is a multimechanism-based network, while the latter is a position-wise fully connected feed-forwa consists of two linear transformations with Gaussian Error Linear Unit (GELU) activation task, we believe that the multi-head attention mechanism can capture different types of toke using different attention matrices, and the self-attention mechanism spans the entire sequenc user profiles so that question-profile interactions are learned.The specific formulations of t can be found in the [4] and will not be repeated here.Based on the two sublayers, a residual each of the two sub-layers and dropouts to the output of each sub-layer is applied [2].In su representation of each layer is shown as follows: −1 = LayerNorm � −1 + Dropout�   ( −1 )��.(8) After  layers that hierarchically exchange information across all positions in the previous l final output   for all tokens of the input sequence.And next, we should perform candidate by using this feature embedding and then route the newly posted question to the candidate ranked higher.Output.We use the softmax function to obtain the probability of the profile being relevant   = softmax(  •  +   ), (9) where   ∈ ℝ ×(+2) is the learnable projection matrix and   is bias terms.C is the num compute this probability for each candidate independently and obtain the final list of ex ranking them with respect to these probabilities.Fine-tune and Training.We use a BERTBASE model (hidden size of 768, 12 Transformer b attention heads) as a binary classification model.We start training from it and fine-tune routing task using the cross-entropy loss.Specifically, limited by the size of our training co weights of the first few layers of the pre-trained network during fine-tuning.We believe BERT model fully incorporates context information at each token position and contains sen where  denotes the number of topics, [ 1 ,  2 , … ,   ] denotes the questions sequence, and [ 1 ,  2 , … ,   ] denotes the user profile sequence.

Sequence Pair Fusion Representation Module based on BERT.
We take the linear concatenation of question tokens and the profile tokens as input.For a given token   , its final input representation ℎ  0 ∈ ℝ constructed by summing word piece embedding   , the segment embedding   , and position embedding  the same dimension: We truncate the question to have at most 64 tokens, and the user profile is truncated to ensure that concatenation of the question, profile, and separator token has a maximum length of 512 tokens.As illustrated in Figure 1, when the input sequence passes through the multi-layer Transformer encoder bloc the tokens of the entire sequence are read by each Transformer encoder at once and learned by the self-attent mechanism that results in contextualized embeddings at different positions in each layer.Specifically, e Transformer layer Trm has two sublayers: MultiSelf and PFFN.The former is a multi-head self-attent mechanism-based network, while the latter is a position-wise fully connected feed-forward network wh consists of two linear transformations with Gaussian Error Linear Unit (GELU) activation in between.In task, we believe that the multi-head attention mechanism can capture different types of token relationships using different attention matrices, and the self-attention mechanism spans the entire sequence of questions user profiles so that question-profile interactions are learned.The specific formulations of these two sublay can be found in the [4] and will not be repeated here.Based on the two sublayers, a residual connection arou each of the two sub-layers and dropouts to the output of each sub-layer is applied [2].In summary, the hid representation of each layer is shown as follows: −1 = LayerNorm � −1 + Dropout�    ( −1 )��.(8) After  layers that hierarchically exchange information across all positions in the previous layer, we obtain final output   for all tokens of the input sequence.And next, we should perform candidate answerer rank by using this feature embedding and then route the newly posted question to the candidate answerers that ranked higher.Output.We use the softmax function to obtain the probability of the profile being relevant: = softmax(  •  +   ), (9) where   ∈ ℝ ×(+2) is the learnable projection matrix and   is bias terms.C is the number of labels.compute this probability for each candidate independently and obtain the final list of experts (profiles) ranking them with respect to these probabilities.Fine-tune and Training.We use a BERTBASE model (hidden size of 768, 12 Transformer blocks, and 12 s attention heads) as a binary classification model.We start training from it and fine-tune it to our quest routing task using the cross-entropy loss.Specifically, limited by the size of our training corpus, we freeze weights of the first few layers of the pre-trained network during fine-tuning.We believe that a well-trai , where  denotes the number of topics, [ 1 ,  2 , … ,   ] denotes the questions sequenc [ 1 ,  2 , … ,   ] denotes the user profile sequence.

Sequence Pair Fusion Representation Module based on BERT.
We take the linear conc question tokens and the profile tokens as input.For a given token   , its final input represent constructed by summing word piece embedding   , the segment embedding   , and position the same dimension: ℎ  0 =   +   +   .
We truncate the question to have at most 64 tokens, and the user profile is truncated to concatenation of the question, profile, and separator token has a maximum length of 512 toke As illustrated in Figure 1, when the input sequence passes through the multi-layer Transformer the tokens of the entire sequence are read by each Transformer encoder at once and learned by t mechanism that results in contextualized embeddings at different positions in each layer.S Transformer layer Trm has two sublayers: MultiSelf and PFFN.The former is a multi-he mechanism-based network, while the latter is a position-wise fully connected feed-forward consists of two linear transformations with Gaussian Error Linear Unit (GELU) activation in task, we believe that the multi-head attention mechanism can capture different types of token using different attention matrices, and the self-attention mechanism spans the entire sequence user profiles so that question-profile interactions are learned.The specific formulations of the can be found in the [4] and will not be repeated here.Based on the two sublayers, a residual co each of the two sub-layers and dropouts to the output of each sub-layer is applied [2].In summ representation of each layer is shown as follows: Trm( −1 ) = LayerNorm � −1 + Dropout�( −1 )��, −1 = LayerNorm � −1 + Dropout�   ( −1 )��.(8) After  layers that hierarchically exchange information across all positions in the previous lay final output   for all tokens of the input sequence.And next, we should perform candidate a by using this feature embedding and then route the newly posted question to the candidate an ranked higher.Output.We use the softmax function to obtain the probability of the profile being relevant: = softmax(  •  +   ), (9) where   ∈ ℝ ×(+2) is the learnable projection matrix and   is bias terms.C is the numb compute this probability for each candidate independently and obtain the final list of expe ranking them with respect to these probabilities.Fine-tune and Training.We use a BERTBASE model (hidden size of 768, 12 Transformer blo attention heads) as a binary classification model.We start training from it and fine-tune it routing task using the cross-entropy loss.Specifically, limited by the size of our training corp where  denotes the number of topics, [ 1 ,  2 , … ,   ] denotes the questions sequence, and [ 1 ,  2 , … ,   ] denotes the user profile sequence.

Sequence Pair Fusion Representation Module based on BERT.
We take the linear concatenation of question tokens and the profile tokens as input.For a given token   , its final input representation ℎ  0 ∈ ℝ constructed by summing word piece embedding   , the segment embedding   , and position embedding  the same dimension: We truncate the question to have at most 64 tokens, and the user profile is truncated to ensure that concatenation of the question, profile, and separator token has a maximum length of 512 tokens.As illustrated in Figure 1, when the input sequence passes through the multi-layer Transformer encoder blo the tokens of the entire sequence are read by each Transformer encoder at once and learned by the self-atten mechanism that results in contextualized embeddings at different positions in each layer.Specifically, Transformer layer Trm has two sublayers: MultiSelf and PFFN.The former is a multi-head self-atten mechanism-based network, while the latter is a position-wise fully connected feed-forward network w consists of two linear transformations with Gaussian Error Linear Unit (GELU) activation in between.In task, we believe that the multi-head attention mechanism can capture different types of token relationship using different attention matrices, and the self-attention mechanism spans the entire sequence of questions user profiles so that question-profile interactions are learned.The specific formulations of these two subla can be found in the [4] and will not be repeated here.Based on the two sublayers, a residual connection aro each of the two sub-layers and dropouts to the output of each sub-layer is applied [2].In summary, the hid representation of each layer is shown as follows: Trm( −1 ) = LayerNorm � −1 + Dropout�( −1 )��, −1 = LayerNorm � −1 + Dropout�    ( −1 )��.(8) After  layers that hierarchically exchange information across all positions in the previous layer, we obtain final output   for all tokens of the input sequence.And next, we should perform candidate answerer ran by using this feature embedding and then route the newly posted question to the candidate answerers tha ranked higher.Output.We use the softmax function to obtain the probability of the profile being relevant: = softmax(  •  +   ), (9) where   ∈ ℝ ×(+2) is the learnable projection matrix and   is bias terms.C is the number of labels.compute this probability for each candidate independently and obtain the final list of experts (profiles ranking them with respect to these probabilities.Fine-tune and Training.We use a BERTBASE model (hidden size of 768, 12 Transformer blocks, and 12 attention heads) as a binary classification model.We start training from it and fine-tune it to our ques (8) After L layers that hierarchically exchange information across all positions in the previous layer, we obtain the final output H L for all tokens of the input sequence.And next, we should perform candidate answerer ranking by using this feature embedding and then route the newly posted question to the candidate answerers that are ranked higher.
Output.We use the softmax function to obtain the probability of the profile being relevant: 8 t the multi-head attention mechanism can capture different types of token relationships by tion matrices, and the self-attention mechanism spans the entire sequence of questions and question-profile interactions are learned.The specific formulations of these two sublayers 4] and will not be repeated here.Based on the two sublayers, a residual connection around layers and dropouts to the output of each sub-layer is applied [2].In summary, the hidden ch layer is shown as follows: = Trm( −1 ), rm( −1 ) = LayerNorm � −1 + Dropout�( −1 )��, −1 = LayerNorm � −1 + Dropout�    ( −1 )��.(8) ierarchically exchange information across all positions in the previous layer, we obtain the ll tokens of the input sequence.And next, we should perform candidate answerer ranking embedding and then route the newly posted question to the candidate answerers that are softmax function to obtain the probability of the profile being relevant:   = softmax(  •  +   ), (9) 2) is the learnable projection matrix and   is bias terms.C is the number of labels.We bility for each candidate independently and obtain the final list of experts (profiles) by espect to these probabilities.ning.We use a BERTBASE model (hidden size of 768, 12 Transformer blocks, and 12 selfa binary classification model.We start training from it and fine-tune it to our question e cross-entropy loss.Specifically, limited by the size of our training corpus, we freeze the few layers of the pre-trained network during fine-tuning.We believe that a well-trained ncorporates context information at each token position and contains sentence relationship mbedding of [CLS].The loss is shown in Equation (10).(10) relevant score of the question and user,  + is the set of indexed answerers (positive label) indexed random non-answerers (negative label).The model is fine-tuned by minimizing ss.
, (9) where W f ∈ ℝ C×(e + 2k) is the learnable projection matrix and b f is bias terms.C is the number of labels.We compute this probability for each candidate independently and obtain the final list of experts (profiles) by ranking them with respect to these probabilities.Fine-tune and Training.We use a BERT BASE model (hidden size of 768, 12 Transformer blocks, and 12 self-attention heads) as a binary classification model.We start training from it and fine-tune it to our question routing task using the cross-entropy loss.Specifically, limited by the size of our training corpus, we freeze the weights of the first few layers of the pretrained network during fine-tuning.We believe that a well-trained BERT model fully incorporates context information at each token position and contains sentence relationship information in the embedding of [CLS].The loss is shown in Equation (10).ens and the profile tokens as input.For a given token   , its final input representation ℎ  0 ∈ ℝ  is by summing word piece embedding   , the segment embedding   , and position embedding   of mension: e the question to have at most 64 tokens, and the user profile is truncated to ensure that the on of the question, profile, and separator token has a maximum length of 512 tokens.d in Figure 1, when the input sequence passes through the multi-layer Transformer encoder blocks, f the entire sequence are read by each Transformer encoder at once and learned by the self-attention that results in contextualized embeddings at different positions in each layer.Specifically, each r layer Trm has two sublayers: MultiSelf and PFFN.The former is a multi-head self-attention -based network, while the latter is a position-wise fully connected feed-forward network which wo linear transformations with Gaussian Error Linear Unit (GELU) activation in between.In our ieve that the multi-head attention mechanism can capture different types of token relationships by ent attention matrices, and the self-attention mechanism spans the entire sequence of questions and s so that question-profile interactions are learned.The specific formulations of these two sublayers d in the [4] and will not be repeated here.Based on the two sublayers, a residual connection around two sub-layers and dropouts to the output of each sub-layer is applied [2].In summary, the hidden on of each layer is shown as follows: Trm( −1 ) = LayerNorm � −1 + Dropout�( −1 )��, −1 = LayerNorm � −1 + Dropout�   ( −1 )��.(8) rs that hierarchically exchange information across all positions in the previous layer, we obtain the   for all tokens of the input sequence.And next, we should perform candidate answerer ranking s feature embedding and then route the newly posted question to the candidate answerers that are er.e use the softmax function to obtain the probability of the profile being relevant: = softmax(  •  +   ), (9) ℝ ×(+2) is the learnable projection matrix and   is bias terms.C is the number of labels.We is probability for each candidate independently and obtain the final list of experts (profiles) by with respect to these probabilities.nd Training.We use a BERTBASE model (hidden size of 768, 12 Transformer blocks, and 12 selfads) as a binary classification model.We start training from it and fine-tune it to our question using the cross-entropy loss.Specifically, limited by the size of our training corpus, we freeze the the first few layers of the pre-trained network during fine-tuning.We believe that a well-trained el fully incorporates context information at each token position and contains sentence relationship in the embedding of [CLS].The loss is shown in Equation (10).(10) notes the relevant score of the question and user,  + is the set of indexed answerers (positive label) e set of indexed random non-answerers (negative label).The model is fine-tuned by minimizing tropy loss., (10) where r i denotes the relevant score of the question and user, I + is the set of indexed answerers (positive label) and I -is the set of indexed random non-answerers (negative label).The model is fine-tuned by minimizing the cross-entropy loss.

Contextual Representation-based Approach: QR-BERT rep
Different from the above method that focuses on exploring interactions between sequences through BERT and incorporating tag-word topic models to enhance understanding of corpus-level semantic information, QR-BERT rep incorporates the weighted sum of the outputs of different layers of BERT as an additional feature into a traditional Siamese deep matching model.By combining contextualized embeddings with word embeddings, the representations of question sequences and user profile sequences can imply richer semantic knowledge and patterns, helping to improve the expert discovery effect obtained by similarity computation.The overall framework of the contextual representation-based model QR-BERT rep is shown in Figure 2.
BERT contextualized embedding.Instead of concatenating the question tokens and the profile tokens into a single sequence as input, in this method, the question tokens and the profile tokens are fed into the pre-trained BERT BASE model separately to obtain the contextualized embedding layer by layer.
Encoding layer.Since BERT generates L-layer hidden states for all BPE tokens in a sequence, and each hidden layer contains different features and information, we employ a weighted sum of these hidden states to obtain more delicate embedding.Specifically, we take the hidden states of the last four layers in BERT.Suppose a word w is tokenized to n BPE tokens w = {b 1 , b 2 , ..., b n }, and h i l represents the token embedding in the l-th layer of BERT, 1 ≤ l ≤ L, 1 ≤ i ≤ n.Then, the contextualized embedding of word w, ConEM w , is calculated as the weighted sum average of the embedding of the last four layers.

Contextual Representation-based Approach: QR-BERTrep
Different from the above method that focuses on exploring interactions between sequences through BERT an incorporating tag-word topic models to enhance understanding of corpus-level semantic information, QR BERTrep incorporates the weighted sum of the outputs of different layers of BERT as an additional feature int a traditional Siamese deep matching model.By combining contextualized embeddings with word embeddings the representations of question sequences and user profile sequences can imply richer semantic knowledge an patterns, helping to improve the expert discovery effect obtained by similarity computation.The overa framework of the contextual representation-based model QR-BERTrep is shown in Figure 2. BERT contextualized embedding.Instead of concatenating the question tokens and the profile tokens into single sequence as input, in this method, the question tokens and the profile tokens are fed into the pre-traine BERTBASE model separately to obtain the contextualized embedding layer by layer.Encoding layer.Since BERT generates -layer hidden states for all BPE tokens in a sequence, and each hidde layer contains different features and information, we employ a weighted sum of these hidden states to obtai more delicate embedding.Specifically, we take the hidden states of the last four layers in BERT.Suppose word  is tokenized to  BPE tokens  = { 1 ,  2 , … ,   }, and ℎ   represents the token embedding in the -t layer of BERT, 1 ≤  ≤ L, 1 ≤  ≤ .Then, the contextualized embedding of word , ConEM  , is calculate as the weighted sum average of the embedding of the last four layers.
where   denotes the weight for each layer.Then, we concatenate the 300-dim GloVe embedding an contextualized embedding ConEM  together to build a richer representation for each word.Therefore, th input vector for each word in the question sequence and profile sequence is  = [GloVe(w);ConEM  ].Siamese neural ranking model.After encoding each word into a fixed-length fusion vector, we represent th question sequence and profile sequence by the fusion embeddings and feed them into a Siamese neural rankin model, which consists of two fully connected hidden layers with 300 nodes.This model is used to map wor vectors to their semantic concept vectors for further similarity calculation.In detail, if we denote  as the inpu word vector,  as the output vector, ℎ  as the hidden layer vector,   as the  ℎ weight matrix, and   as th  ℎ bias term, the mathematical formulas for each layer are described as follows: where the  value goes from the first hidden layer  = 2 to the output layer  = , and we use the tanh as th activation function: (15) Output layer.The output layer consists of 128 nodes.We measure the semantic similarity between question and profile document p as: where   and   are the concept vectors of the question and the user's profile, respectively.We apply th softmax function on the output to covert the similarity relevance score into a probability of the user's profil given the question as shown below: where  denotes the number of candidates to be ranked.We approximated  to be the list of the answerers which includes the actual answerers and three randomly selected non-answerers.We detail the constructio method of negative examples and positive examples for training in Section 4.1. ( where δ l denotes the weight for each layer.Then, we concatenate the 300-dim GloVe embedding and contextualized embedding ConEM w together to build a richer representation for each word.Therefore, the input vector for each word in the question sequence and profile sequence is w = [GloVe(w); ConEM w ].
Siamese neural ranking model.After encoding each word into a fixed-length fusion vector, we represent the question sequence and profile sequence by the fusion embeddings and feed them into a Siamese neural ranking model, which consists of two fully connected hidden layers with 300 nodes.This model is used to map word vectors to their semantic concept vectors for further similarity calculation.In detail, if we denote x as the input word vector, y as the output vector, h i as the hidden layer vector, W i as the i th weight matrix, and b i as the i th bias term, the mathematical formulas for each layer are described as follows:

Contextual Representation-based Approach: QR-BERTrep
Different from the above method that focuses on exploring interactions between sequences through BERT and incorporating tag-word topic models to enhance understanding of corpus-level semantic information, QR-BERTrep incorporates the weighted sum of the outputs of different layers of BERT as an additional feature into a traditional Siamese deep matching model.By combining contextualized embeddings with word embeddings, the representations of question sequences and user profile sequences can imply richer semantic knowledge and patterns, helping to improve the expert discovery effect obtained by similarity computation.The overall framework of the contextual representation-based model QR-BERTrep is shown in Figure 2. BERT contextualized embedding.Instead of concatenating the question tokens and the profile tokens into a single sequence as input, in this method, the question tokens and the profile tokens are fed into the pre-trained BERTBASE model separately to obtain the contextualized embedding layer by layer.Encoding layer.Since BERT generates -layer hidden states for all BPE tokens in a sequence, and each hidden layer contains different features and information, we employ a weighted sum of these hidden states to obtain more delicate embedding.Specifically, we take the hidden states of the last four layers in BERT.Suppose a word  is tokenized to  BPE tokens  = { 1 ,  2 , … ,   }, and ℎ   represents the token embedding in the -th layer of BERT, 1 ≤  ≤ L, 1 ≤  ≤ .Then, the contextualized embedding of word , ConEM  , is calculated as the weighted sum average of the embedding of the last four layers.
where   denotes the weight for each layer.Then, we concatenate the 300-dim GloVe embedding and contextualized embedding ConEM  together to build a richer representation for each word.Therefore, the input vector for each word in the question sequence and profile sequence is  = [GloVe(w);ConEM  ].Siamese neural ranking model.After encoding each word into a fixed-length fusion vector, we represent the question sequence and profile sequence by the fusion embeddings and feed them into a Siamese neural ranking model, which consists of two fully connected hidden layers with 300 nodes.This model is used to map word vectors to their semantic concept vectors for further similarity calculation.In detail, if we denote  as the input word vector,  as the output vector, ℎ  as the hidden layer vector,   as the  ℎ weight matrix, and   as the  ℎ bias term, the mathematical formulas for each layer are described as follows: where the  value goes from the first hidden layer  = 2 to the output layer  = , and we use the tanh as the activation function: (15) Output layer.The output layer consists of 128 nodes.We measure the semantic similarity between question q and profile document p as:

Contextual Representation-based Approach: QR-BERTrep
Different from the above method that focuses on exploring interactions between sequences through BERT and incorporating tag-word topic models to enhance understanding of corpus-level semantic information, QR-BERTrep incorporates the weighted sum of the outputs of different layers of BERT as an additional feature into a traditional Siamese deep matching model.By combining contextualized embeddings with word embeddings, the representations of question sequences and user profile sequences can imply richer semantic knowledge and patterns, helping to improve the expert discovery effect obtained by similarity computation.The overall framework of the contextual representation-based model QR-BERTrep is shown in Figure 2. BERT contextualized embedding.Instead of concatenating the question tokens and the profile tokens into a single sequence as input, in this method, the question tokens and the profile tokens are fed into the pre-trained BERTBASE model separately to obtain the contextualized embedding layer by layer.Encoding layer.Since BERT generates -layer hidden states for all BPE tokens in a sequence, and each hidden layer contains different features and information, we employ a weighted sum of these hidden states to obtain more delicate embedding.Specifically, we take the hidden states of the last four layers in BERT.Suppose a word  is tokenized to  BPE tokens  = { 1 ,  2 , … ,   }, and ℎ   represents the token embedding in the -th layer of BERT, 1 ≤  ≤ L, 1 ≤  ≤ .Then, the contextualized embedding of word , ConEM  , is calculated as the weighted sum average of the embedding of the last four layers.
where   denotes the weight for each layer.Then, we concatenate the 300-dim GloVe embedding and contextualized embedding ConEM  together to build a richer representation for each word.Therefore, the input vector for each word in the question sequence and profile sequence is  = [GloVe(w);ConEM  ].Siamese neural ranking model.After encoding each word into a fixed-length fusion vector, we represent the question sequence and profile sequence by the fusion embeddings and feed them into a Siamese neural ranking model, which consists of two fully connected hidden layers with 300 nodes.This model is used to map word vectors to their semantic concept vectors for further similarity calculation.In detail, if we denote  as the input word vector,  as the output vector, ℎ  as the hidden layer vector,   as the  ℎ weight matrix, and   as the  ℎ bias term, the mathematical formulas for each layer are described as follows: where the  value goes from the first hidden layer  = 2 to the output layer  = , and we use the tanh as the activation function: (15) Output layer.The output layer consists of 128 nodes.We measure the semantic similarity between question q and profile document p as: ing model.After encoding each word into a fixed-length fusion vector, we represent the d profile sequence by the fusion embeddings and feed them into a Siamese neural ranking ts of two fully connected hidden layers with 300 nodes.This model is used to map word ntic concept vectors for further similarity calculation.In detail, if we denote  as the input e output vector, ℎ  as the hidden layer vector,   as the  ℎ weight matrix, and   as the thematical formulas for each layer are described as follows: es from the first hidden layer  = 2 to the output layer  = , and we use the tanh as the (15) tput layer consists of 128 nodes.We measure the semantic similarity between question q t p as: the concept vectors of the question and the user's profile, respectively.We apply the the output to covert the similarity relevance score into a probability of the user's profile shown below: number of candidates to be ranked.We approximated  to be the list of the answerers, ctual answerers and three randomly selected non-answerers.We detail the construction xamples and positive examples for training in Section 4.1. , where the i value goes from the first hidden layer i = 2 to the output layer i = N, and we use the tanh as the activation function: 9 eight for each layer.Then, we concatenate the 300-dim GloVe embedding and g ConEM  together to build a richer representation for each word.Therefore, the d in the question sequence and profile sequence is  = [GloVe(w);ConEM  ]. model.After encoding each word into a fixed-length fusion vector, we represent the ofile sequence by the fusion embeddings and feed them into a Siamese neural ranking two fully connected hidden layers with 300 nodes.This model is used to map word concept vectors for further similarity calculation.In detail, if we denote  as the input tput vector, ℎ  as the hidden layer vector,   as the  ℎ weight matrix, and   as the atical formulas for each layer are described as follows: om the first hidden layer  = 2 to the output layer  = , and we use the tanh as the (15) layer consists of 128 nodes.We measure the semantic similarity between question q s: concept vectors of the question and the user's profile, respectively.We apply the utput to covert the similarity relevance score into a probability of the user's profile wn below: ber of candidates to be ranked.We approximated  to be the list of the answerers, l answerers and three randomly selected non-answerers.We detail the construction ples and positive examples for training in Section 4.1. .
Output layer.The output layer consists of 128 nodes.We measure the semantic similarity between question q and profile document p as: ted sum average of the embedding of the last four layers.
notes the weight for each layer.Then, we concatenate the 300-dim GloVe embedding and ed embedding ConEM  together to build a richer representation for each word.Therefore, the for each word in the question sequence and profile sequence is  = [GloVe(w);ConEM  ]. ural ranking model.After encoding each word into a fixed-length fusion vector, we represent the uence and profile sequence by the fusion embeddings and feed them into a Siamese neural ranking h consists of two fully connected hidden layers with 300 nodes.This model is used to map word eir semantic concept vectors for further similarity calculation.In detail, if we denote  as the input ,  as the output vector, ℎ  as the hidden layer vector,   as the  ℎ weight matrix, and   as the , the mathematical formulas for each layer are described as follows: value goes from the first hidden layer  = 2 to the output layer  = , and we use the tanh as the nction: (15) r.The output layer consists of 128 nodes.We measure the semantic similarity between question q ocument p as: d   are the concept vectors of the question and the user's profile, respectively.We apply the ction on the output to covert the similarity relevance score into a probability of the user's profile estion as shown below: otes the number of candidates to be ranked.We approximated  to be the list of the answerers, des the actual answerers and three randomly selected non-answerers.We detail the construction egative examples and positive examples for training in Section 4. , where y q and y p are the concept vectors of the question and the user's profile, respectively.We apply the softmax function on the output to covert the similarity relevance score into a probability of the user's profile given the question as shown below: 9 fferent features and information, we employ a weighted sum of these hidden states to obtain bedding.Specifically, we take the hidden states of the last four layers in BERT.Suppose a zed to  BPE tokens  = { 1 ,  2 , … ,   }, and ℎ   represents the token embedding in the -th ≤  ≤ L, 1 ≤  ≤ .Then, the contextualized embedding of word , ConEM  , is calculated um average of the embedding of the last four layers.
s the weight for each layer.Then, we concatenate the 300-dim GloVe embedding and mbedding ConEM  together to build a richer representation for each word.Therefore, the ach word in the question sequence and profile sequence is  = [GloVe(w);ConEM  ]. ranking model.After encoding each word into a fixed-length fusion vector, we represent the e and profile sequence by the fusion embeddings and feed them into a Siamese neural ranking nsists of two fully connected hidden layers with 300 nodes.This model is used to map word mantic concept vectors for further similarity calculation.In detail, if we denote  as the input s the output vector, ℎ  as the hidden layer vector,   as the  ℎ weight matrix, and   as the mathematical formulas for each layer are described as follows: goes from the first hidden layer  = 2 to the output layer  = , and we use the tanh as the n: (15) e output layer consists of 128 nodes.We measure the semantic similarity between question q ment p as: are the concept vectors of the question and the user's profile, respectively.We apply the on the output to covert the similarity relevance score into a probability of the user's profile n as shown below: the number of candidates to be ranked.We approximated  to be the list of the answerers, he actual answerers and three randomly selected non-answerers.We detail the construction ve examples and positive examples for training in Section 4.1. , where K denotes the number of candidates to be ranked.We approximated K to be the list of the answerers, which includes the actual answerers and three randomly selected non-answerers.We detail the construction method of negative examples and positive examples for training in Section 4.1.

Training.
In training, the model parameters are estimated to maximize the likelihood of positive answerers given the questions across the training set.Put another way, we need to minimize the loss function, as shown in Equation (18).
raining, the model parameters are estimated to maximize the likelihood of positive answerers tions across the training set.Put another way, we need to minimize the loss function, as shown 8).
e set of indexed answerers (positive labels).ts d our dataset from a Stack Overflow snapshot by following all conditions mentioned in [12] and e the size of the dataset while exhibiting the same properties according to the original dataset, tags reported in [12] to create a subset that is mainly selected according to the following criteria.question is an archived question with an accepted answer (i.e., best answer), and it has at least d at least one of its tags matches the selected 21 specific tags.All questions are lowercase, and the questions with at least 2 words left after removing the stop words.The purpose of these ations is to filter out low-quality posts.As a result, the final subset contains 92,411 CQA sessions.the posted timestamp, the first 12 months of data are used as the training data, and the remaining for testing.Therefore, the training and testing data do not overlap.There were 81,295 sessions set and 11,116 sessions in the test set.d to predict the best answerer and the reality that only a few users are responsible for the vast swers in CQA, three user sets Dx were constructed based on the number of answers X provided e training set (X = 10, 15, and 20 in this work).As can be seen from Table 1, set D20 includes , (18) where I + is the set of indexed answerers (positive labels).

Dataset
We constructed our dataset from a Stack Overflow snapshot by following all conditions mentioned in [12] and [27].To reduce the size of the dataset while exhibiting the same properties according to the original dataset, we use the 21 tags reported in [12] to create a subset that is mainly selected according to the following criteria.Each selected question is an archived question with an accepted answer (i.e., best answer), and it has at least 2 answers, and at least one of its tags matches the selected 21 specific tags.All questions are lowercase, and we only keep the questions with at least 2 words left after removing the stop words.The purpose of these selection operations is to filter out low-quality posts.As a result, the final subset contains 92,411 CQA sessions.According to the posted timestamp, the first 12 months of data are used as the training data, and the remaining data are used for testing.Therefore, the training and testing data do not overlap.There were 81,295 sessions in the training set and 11,116 sessions in the test set.Given the need to predict the best answerer and the reality that only a few users are responsible for the vast majority of answers in CQA, three user sets D x were constructed based on the number of answers X provided by users in the training set (X = 10, 15, and 20 in this work).As can be seen from Table 1, set D 20 includes 2,977 users, indicating that these users provided at least 20 answers in this training set.Moreover, for each of the 8371 training questions, the questioner, the best answerer, and at least one other answerer are among these 2977 users.For the 517 test questions, they were routed to these 2977 users. .Following [1], if user u i is in the list of answerers of q (the list includes the best answerer and other answerers of one thread), we con-sider (q, u i ) as a positive example; otherwise, we consider (q, u j ) as a negative example.(q, answerer 1 ) Positive (q, answerer 2 ) Positive … Positive (q, answerer n ) Positive (q, random-non-answerer 1 ) Negative (q, random-non-answerer 2 ) Negative (q, random-non-answerer 3 ) Negative

Baseline Methods and Experimental Setting
Baseline Methods.To evaluate the performance of our proposed models, we use the following three different types of baselines for comparison: the traditional information retrieval model, topic-based model, and deep learning-based model.
1 Traditional IR model TF-IDF: TF-IDF [27] is a standard measure of computing the importance and relevance of a word document based on the frequency of that word in the document and the inverse proportion of documents containing the word over the entire document corpus.For the question routing and expert finding task, we represent the posted question and user profile as vectors of their TF-IDF weights and then calculate the cosine similarity between each user profile and question vector.
2 Topic-based model LDA: LDA [17] is a three-level hierarchical Bayesian model that has been widely applied to address the term mismatch problem in IR.It mainly relies on word co-occurrence relationships and takes semantic in-formation into account.In our experiments, all questions answered by a user are concatenated to build the user profile.We use Gibbs-LDA++ [14] with topic size K=100 to conduct LDA training.We set the LDA hyper-parameters α = 0.5 and β = 0.1, respectively.
MLQR: MLQR [6] is a multi-objective learning-torank approach in which a tag-word topic model was proposed and applied to address the question routing problem.In this experiment, we set the number of topics K = 80, α = 0.7, β = 0.01, and γ = 0.01.Gibbs Sampling is run for 1000 iterations.
3 Deep learning-based model QR-DSSM: QR-DSSM [1] is a typical deep neural Siamese Network based on DSSM [11] to capture the semantic similarity between the profiles of the candidates and the posted question.In our experiment, to facilitate subsequent comparisons, we use GloVe embedding to represent the sequence instead of using the word hash embedding method.The code blocks are removed from the dataset.The number of iterations of the neural network is 100, and the learning rate is 0.02.CNN-based method: A CNN-based method [31] treats question routing as a classification problem and takes the best answerer of each question as a positive training example as well as the ground truth.We adopt the CNN-non-static [13] to capture the semantics of the text for best answerer prediction, which uses filter windows of 3, 4, and 5 with 100 feature maps each.The dropout rate is 0.5, and the mini-batch size is 50.Experimental Setting.We use the English uncased BERT-Base model released by Google, which has 12 layers, 768 hidden states, and 12 heads.Models are implemented with TensorFlow using TPUs.Regarding the selection of hyperparameters, we fixed some empirically, such as choosing the Adam weight decay optimizer for the optimization with L2 weight decay of 0.01, β1 = 0.9, and β2 = 0.999.The dropout probability is always kept at 0.1.Some hyperparameters were set to different values during the training and were chosen according to their impact on performance.The initial learning rate and batch size are set to [1e-3, 2e-5, 1e-7] and [16,32,64], respectively.In the tag-word topic model, we set α=0.8, β=0.01, and γ=0.01.The number of topics varies from 20 to 90.In addition, since the randomness of the parameter initialization leads to different results each time, we averaged the results for 10 runs.

Evaluation Metrics
The evaluation criteria measure how well the system ranks the correct pair (q, answerer i ) against the other random candidates for the same question (q, random-non-answerer j ).Therefore, we adapt several standard metrics for expert finding and question routing to evaluate the performance as follows.
1 Precision at N (P@N): The precision at N reports the percentage of predicted positive users/experts observed at the top N retrieved results.In other words, it is the ratio of the number of positive users to the total number of candidates until N.For example, Precision@1(P@1) aims to compute the percentage of times the system ranks the correct answerers as the top item.More specifically, if our model returns 10 users for a given question, the relevant users are ranked at 1, 2, 4, 6, and 9.Then, the P@5 is 3/5 and the P@10 is 5/10 in this case.trics teria measure how well the system ranks the correct pair (q, answereri) against the other for the same question (q, random-non-answererj).Therefore, we adapt several standard inding and question routing to evaluate the performance as follows.
(P@N): The precision at N reports the percentage of predicted positive users/experts N retrieved results.In other words, it is the ratio of the number of positive users to the total tes until N.For example, Precision@1(P@1) aims to compute the percentage of times the rrect answerers as the top item.More specifically, if our model returns 10 users for a given ant users are ranked at 1, 2, 4, 6, and 9.Then, the P@5 is 3/5 and the P@10 is 5/10 in this al Rank (MRR): The MRR computes the inverse of the rank of the correct answerer among eraged for all queries.Alternatively, we can describe it as reflecting the average ranking of 's first appearance in a given test set question.For a given query set Q, we use the following e MRR. = ber of queries and rankj is the position of the correct answerer.Precision (MAP): The MAP shows the overall retrieval quality score, which is the arithmetic e precision score for each test set question.
sults and Analysis ts the effectiveness of our proposed models on question routing tasks comparing three types over our dataset.
where N is the number of queries and rank j is the position of the correct answerer.
3 Mean Average Precision (MAP): The MAP shows the overall retrieval quality score, which is the arithmetic mean of the average precision score for each test set question.

Experiment Results and Analysis
This section presents the effectiveness of our proposed models on question routing tasks comparing three types of different models over our dataset.

Performance Analysis of Our Proposed Model Compared to Baseline Models
The results are summarized in Tables 3-5.We can see that all BERT-based models perform much better than the existing traditional retrieval models and the recently proposed neural network-based models.In detail, several main observations can be concluded from these tables.
1 Topic-based models exhibit much better performance than the traditional information retrieval approaches.This finding suggests that semantic understanding is important in the question routing task of text-based analysis.Approaches that rely on lexical matching without any text semantics have significant limitations.Moreover, MLQR consistently performs better than the LDA model, which 4 From Tables 3-5, we can observe that the D 20 set achieves better results than the D 15 set, and the D 15 set achieves better results than the D 10 set.This indicates that fewer negative samples can lead to better results in our dataset.
5 In addition, we note that the absolute values of model performance are relatively low in all three tables.We summarize the main reasons for this as follows.
First, CQA faces a serious data sparsity problem, which leads to insufficient text for question modeling and user modeling.Not only the text lengths of the questions and answers are short, but we also can see from Table 1 that the number of questions is much larger than the number of answerers.With an average of only a few user comments per question and a very low average number of answers posted per user, the reality is that most users are not active in CQA.Second, we constructed our dataset by including the 21 most frequent tags, rather than including only a few tags.This makes our dataset more generalizable, but also more diverse in terms of the topics for which information is searched.As a result, finding the right expert to answer a specific question can be very challenging.Third, Stack Overflow is a vertical community Q&A with a complex composition of data, including code snippets, tables, domain-specific terms, and a few other discrete pieces of text.All of these factors contribute to the low absolute value of performance data.

Analysis of Representation-based Methods and Interaction-based Methods
As mentioned before, our proposed two BERT-based models, QR-BERT int and QR-tBERT rep , are both very effective on our dataset compared with three types of baselines.In this section, we will analyze their differences in more depth, and the performance comparison is shown in Figure 3.
As indicated in Figure 3, the topic-enhanced interaction-based model QR-tBERT int performs much better than representation-based models QR-DSSM and QR-BERT rep .In addition, QR-tBERT int significantly exceeds QR-BERT rep with a maximum increase of 31.17% in p@5 and 44.54% in p@10, respectively.The main reasons can be summarized as follows: First, QR-tBERT int takes into account the interaction between sequences by connecting questions and user profiles in pairs as input, so that the hierarchical relationship between questions and profiles can be learned (4) From Tables 3-5, we can observe that the D 20 set achieves better results than the D 15 set, and the D 15 set achieves better results than the D 10 set.This indicates that fewer negative samples can lead to better results in our dataset.
(5) In addition, we note that the absolute values of model performance are relatively low in all three tables.We summarize the main reasons for this as follows.First, CQA faces a serious data sparsity problem, which leads to insufficient text for question modeling and user modeling.Not only the text lengths of the questions and answers are short, but we also can see from Table 1 that the number of questions is much larger than the number of answerers.With an average of only a few user comments per question and a very low average number of answers posted per user, the reality is that most users are not active in CQA.Second, we constructed our dataset by including the 21 most frequent tags, rather than including only a few tags.This makes our dataset more generalizable, but also more diverse in terms of the topics for which information is searched.As a result, finding the right expert to answer a specific question can be very challenging.Third, Stack Overflow is a vertical community Q&A with a complex composition of data, including code snippets, tables, domainspecific terms, and a few other discrete pieces of text.All of these factors contribute to the low absolute value of performance data.

Analysis of Representation-based Methods and Interaction-based Methods
As mentioned before, our proposed two BERT-based models, QR-BERT int and QR-tBERT rep , are both very effective on our dataset compared with three types of baselines.In this section, we will analyze their differences in more depth, and the performance comparison is shown in Figure 3.
As indicated in Figure 3, the topic-enhanced interaction-based model QR-tBERTint performs much better than representation-based models QR-DSSM and QR-BERT rep .In addition, QR-tBERT int significantly exceeds QR-BERT rep with a maximum increase of 31.17% in p@5 and 44.54% in p@10, respectively.The main reasons can be summarized as follows: First, QR-tBERT int takes into account the interaction between sequences by connecting questions and user profiles in pairs as input, so that the hierarchical relationship between questions and profiles can be learned as an essential feature of matching.In contrast, QR-BERT rep encodes the question sequence and profile sequence separately so that the interaction between the two sequences is deferred to the end of the matching process, risking the loss of details important for matching.Second, QR-tBERTint takes the fine-tuning strategy to learn cross-attention between terms by directly using Transformers located in BERT.
In contrast, QR-BERT rep only uses the pre-trained network to construct sequence representations.Third, the tag-word topic model can provide corpus-level information to enhance the understanding of the semantic relevance of the text.as an essential feature of matching.In contrast, QR-BERT rep encodes the question sequence and profile sequence separately so that the interaction between the two sequences is deferred to the end of the matching process, risking the loss of details important for matching.Second, QR-tBERT int takes the fine-tuning strategy to learn cross-attention between terms by directly using Transformers located in BERT.In contrast, QR-BERT rep only uses the pre-trained network to construct sequence representations.Third, the tag-word topic model can provide corpus-level information to enhance the understanding of the semantic relevance of the text.

Analysis of Tag-word Topic Representation Module
From Tables 3-5, we can see that combining tag-word topics can consistently improves the question routing performance across all metrics for all datasets.Specifically, without the tagword topic model, the performance of QR-BERT int decreases by 11.41% and 10.33% in MRR and MAP, respectively, compared to QR-tBERT int .The main reason we summarize is that Stack Overflow is a programming-specific Q&A community, where the ability to detect domain-specific terms is crucial for text semantic understanding and matching.However, the pre-training of BERT is based on general domain knowledge and is likely to fail to learn domain-specific words related to programming.Here, the tag-word topic model could serve as an additional source for dataset-specific information.Our findings are consistent with a lot of previous work that also confirms the effectiveness of incorporating topic models when dealing with semantic related tasks in specific knowledge domains, such as sentiment analysis in Microblogs [22] and machine translation [5].

Contextualized Embedding vs. Traditional Word Embedding
In this section, the proposed QR-BERT rep model is compared with the baselines in terms of word embedding representation, which is a crucial part that affects the performance of the representation-based models.To explore the respective effects of contextualized embedding and word embedding, we conducted an ablation experiment called QR-BERT rep (WG) in which the GloVe embedding was removed.The performance comparison of different representation-based models is shown in Figure 4.
It can be seen that the methods using distributed word representation perform much better than the methods that represent the words in a sentence as a "bag of words".Therefore, TF-IDF has the lowest MRR and MAP.In addition, we can see that neural rankers such as QR-DSSM and CNN-based models are greatly facilitated by using pre-trained word embeddings (e.g., Word2Vec or GloVe) for sequence representation.In QR-BERT rep , we concatenate the GloVe embedding and con- It can be seen that the methods using distributed word representation pe that represent the words in a sentence as a "bag of words".Therefore, TF In addition, we can see that neural rankers such as QR-DSSM and CNN by using pre-trained word embeddings (e.g., Word2Vec or GloVe) f BERTrep, we concatenate the GloVe embedding and contextualized em dramatically boosted, almost doubling that of QR-DSSM.This result is c in [19,16], indicating that using contextualized language term embedding is very effective.the ability to detect domain-specific terms is crucial for text semantic understanding and matching.However, the pre-training of BERT is based on general domain knowledge and is likely to fail to learn domain-specific words related to programming.Here, the tag-word topic model could serve as an additional source for datasetspecific information.Our findings are consistent with a lot of previous work that also confirms the effectiveness of incorporating topic models when dealing with semantic related tasks in specific knowledge domains, such as sentiment analysis in Microblogs [22] and machine translation [5].

Contextualized Embedding vs. Traditional Word Embedding
In this section, the proposed QR-BERTrep model is compared with the baselines in terms of word embedding representation, which is a crucial part that affects the performance of the representation-based models.To explore the respective effects of contextualized embedding and word embedding, we conducted an ablation experiment called QR-BERTrep(WG) in which the GloVe embedding was removed.The performance comparison of different representation-based models is shown in Figure 4.
It can be seen that the methods using distributed word representation perform much better than the methods that represent the words in a sentence as a "bag of words".Therefore, TF-IDF has the lowest MRR and MAP.
In addition, we can see that neural rankers such as QR-DSSM and CNN-based models are greatly facilitated by using pre-trained word embeddings (e.g., Word2Vec or GloVe) for sequence representation.In QR-BERTrep, we concatenate the GloVe embedding and contextualized embedding together, the performance is dramatically boosted, almost doubling that of QR-DSSM.This result is consistent with previous observations in [19,16], indicating that using contextualized language term embedding for text understanding and matching is very effective.textualized embedding together, the performance is dramatically boosted, almost doubling that of QR-DSSM.This result is consistent with previous observations in [19,16], indicating that using contextualized language term embedding for text understanding and matching is very effective.

Conclusions
In this paper, we explore two different ways to address the question routing task for CQA based on a pretrained contextual language model.QR-tBERT int is an interaction-based model that takes question-profile pairs as input and fine-tunes BERT to capture the relationship between sequence pairs.In addition, a tag-word topic model is incorporated as an additional source of dataset-specific information.QR-BERT rep is a representation-based model that combines contextualized embedding with traditional static word embedding to enhance the representation for semantic understanding and matching.
Experimental results on real-world data demonstrated that both of our proposed models greatly exceed state-of-the-art baselines.The best result indicates that a question will be answered if it is routed to the top 2 candidates.QR-BERT rep exceeds all representation-based baselines discussed in this paper, showing that contextualized word embedding can carry richer semantic information to enhance the representation in our task.Meanwhile, QR-tBERT int performs much better than QR-BERT rep , which indicates that the question routing task benefits from sequence relationship learning and corpus-level topical semantic information.
Although we have made some progress in our work, in future work we would like to introduce more QA features (e.g., reputation, the willingness of experts) or non-QA features (e.g., number of followers and connected accounts on social networking sites) to enhance the performance.Moreover, taking advantage of the knowledge graph to improve the effectiveness of question routing is a very interesting work for the future.

Figure 1 2 Figure 1 Figure 2
Figure 1The overall framework of the topic-enhanced interaction-based model QR-tBERT int

Figure 4
Figure 4The performance comparison of different representation-based models

Figure 4 Figure 4
Figure 4The performance comparison of different repres , let a set of  ∈  be a candidate set  = { 1 ,  2 ,… ,   } (k tes), and let a set of candidate profiles  = { 1 ,  2 , … ,   }.We need to rank users in hly ranked users, who are most suitable to answer the question  with the required part of this task is learning the match patterns and capturing the relationships between e of the candidate   ∈ .Specifically, in our work, we need to estimate a score   of user's   is to a newly posted question  or we need to calculate a probability  of le   is given the question q.

Table 1
The summary of three datasets We obtained 54,218 positive training pairs for D 10 , 36,238 positive training pairs for D 15 , and 26,354 positive training pairs for D 20 .To train efficiently and reduce the training scale, we randomly select three non-answerers based on the NCE sampling strategy to construct the negative samples.The definitions of negative and positive examples are listed in Table2.

Table 2
Negative and positive examples for training . The initial learning rate and batch size are set to [1e-3, 2e-5, 1e-7] and[16, 32,n the tag-word topic model, we set α=0.8, β=0.01, and γ=0.01.The number of topics varies ddition, since the randomness of the parameter initialization leads to different results each the results for 10 runs. erformance

Table 3
Comparison of different methods for question routing (X=10)

Table 4
Comparison of different methods for question routing (X=15)

Table 5
Comparison of different methods for question routing (X=20)