GRAM: Graph-based Attention Model. A better learning representation ?

The paper I review today will be a feature in next year  ICLR 2017: 5th International Conference on Learning Representations. It appears to be an important contribution to learning theory in the context of machine/deep learning systems and methods. The proper representation  of data is crucial to the successful implementation of deep learning systems, and increasingly so. Representation learning, wich is developing rapidly, deals with the issues surrounding how best  those systems can learn meaningful  and useful representations of data.

One area/sector of important economic or social activity where deep learning systems have been deployed is healthcare. But the experienced professional in this implementation has been challenged by some shortcomings of it. The paper I am reviewing here today concerns precisely an analysis of these issues to then propose a novel method to address them. The issues are mostly related with the insufficiency of data to perform good predictive modeling in healthcare settings, and what is the data representation that best align medical knowledge with the best interpretation possible of the information that data is providing. These points are explicitly clear form the reading of the abstract:

GRAM: GRAPH-BASED ATTENTION MODEL FOR HEALTHCARE REPRESENTATION LEARNING

ABSTRACT

Deep learning methods exhibit promising performance for predictive modeling in healthcare, but two important challenges remain: • Data insufficiency: Often in healthcare predictive modeling, the sample size is insufficient for deep learning methods to achieve satisfactory results. • Interpretation: The representations learned by deep learning models should align with medical knowledge. To address these challenges, we propose a GRaph-based Attention Model, GRAM that supplements electronic health records (EHR) with hierarchical information inherent to medical ontologies. Based on the data volume and the ontology structure, GRAM represents a medical concept as a combination of its ancestors in the ontology via an attention mechanism.

We compared predictive performance (i.e. accuracy, data needs, interpretability) of GRAM to various methods including the recurrent neural network (RNN) in two sequential diagnoses prediction tasks and one heart failure prediction task. Compared to the basic RNN, GRAM achieved 10% higher accuracy for predicting diseases rarely observed in the training data and 3% improved area under the ROC curve for predicting heart failure using an order of magnitude less training data. Additionally, unlike other methods, the medical concept representations learned by GRAM are well aligned with the medical ontology. Finally, GRAM exhibits intuitive attention behaviors by adaptively generalizing to higher level concepts when facing data insufficiency at the lower level concepts.

The first striking impression after reading the abstract of this paper is that the proposed new method by the authors from the Georgia  Institute of Technology and the healthcare institution Sutter health, is the significant improvement in the predictive accuracy with a at the same time better alignment of medical concept representations it learns with the overall medical ontology. GRAM, standing for graph-based attention model, is this new proposal that works in parallel with the Electronic Health Records (EHR) to provide a representation of the hierarchical information inherent in medical ontologies. This method was specifically compared with former approaches such as Recurrent Neural Networks (RNNs) and yielded improved results.

Introduction

Prediction in a highly specialized subject such as healthcare is interesting form ta scientific and technological perspective for several reasons. Over the years this issue was approached with a mind-set where the input to a predictive model should be primarily of an expert type. In that mind set techniques such as logistic regression or multilayer perceptron (MLP) were the default choice. But recently advanced deep learning methods and systems have become an increasingly valid or even better alternative, and the main reasons for this development have been the rapid growth in volume, diversity and complexity of healthcare data from electronic healthcare records or other sources. Some of these new techniques are nicely listed in the very first paragraph of the paper:

The rapid growth in volume and diversity of health care data from electronic health records (EHR) and other sources is motivating the use of predictive modeling to improve care for individual patients. In particular, novel applications are emerging that use deep learning methods such as word embedding (Choi et al., 2016c;d), recurrent neural networks (RNN) (Che et al., 2016; Choi et al., 2016a;b; Lipton et al., 2016), convolutional neural networks (CNN) (Nguyen et al., 2016) or stacked denoising autoencoders (SDA) (Che et al., 2015; Miotto et al., 2016), demonstrating significant performance enhancement for diverse prediction tasks. Deep learning models appear to perform significantly better than logistic regression or multilayer perceptron (MLP) models that depend, to some degree, on expert feature construction (Lipton et al., 2015; Razavian et al., 2016).

But this is where the challenge begins. Deep learning models need large amounts of data, which a single health organization or system alone cannot provide. These models require massive datasets well beyond a single provider in order for the results to be satisfactory. This comes from the combinatorial nature of the data requirement of deep learning models, where these models have to access an exponential number of combinations of input features. Hierarchical organization of medical ontologies helps in alleviating this pressure, where the concepts from those records are treated as if their were a massive network:

The data requirement of deep learning models comes from having to assess exponential number of combinations of input features. This can be alleviated by exploiting medical ontologies that encodes hierarchical clinical constructs and relationships among medical concepts. Fortunately, there are many well-organized ontologies in healthcare such as the International Classification of Diseases (ICD), Clinical Classifications Software (CCS) (Stearns et al., 2001) or Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) (Project et al., 2010). Nodes (i.e. medical concepts) close to one another in medical ontologies are likely to be associated with similar patients, allowing us to transfer knowledge among them. Therefore, proper use of medical ontologies will be helpful when we lack enough data for the nodes in the ontology to train deep learning models.

And if a system like this is analogous to a massive network of nodes, it is prone to be well modelled by a neural network. What is innovative in this paper is the use of a neural attention mechanism that performs prediction with an adaptive combination of current medical ontologies with its ancestor concept in the medical record:

In this work, we propose GRAM, a method that infuses information from medical ontologies into deep learning models via neural attention. Considering the frequency of a medical concept in the EHR data and its ancestors in the ontology, GRAM decides the representation of the medical concept by adaptively combining its ancestors via attention mechanism. This will not only support deep learning models to learn robust representations without large amount of data, but also learn interpretable representations that align well with the knowledge from the ontology. The attention mechanism is trained in an end-to-end fashion with the neural network model that predicts the onset of disease(s). We also propose an effective initialization technique in addition to the ontological knowledge to better guide the representation learning process.

The attention mechanism itself is able to learn the medical record well enough in order to provide scalable intuitive representations of the data in the ontologies, and grouping similar medical concepts close to one another. And it even knows how to selectively deploy its attention resources in an effective manner, depending on data availability and the ontology structure:

We compared predictive performance (i.e. accuracy, data needs, interpretability) of GRAM to various models including the recurrent neural network (RNN) in two sequential diagnoses prediction tasks and one heart failure (HF) prediction task. We demonstrate that GRAM is up to 10% more accurate than the basic RNN for predicting diseases less observed in the training data. After discussing GRAM’s scalability, we visualize the representations learned from various models where GRAM provides more intuitive representations by grouping similar medical concepts close to one another. Finally, we show GRAM’s attention mechanism can be interpreted to understand how it assigns the right amount of attention to the ancestors of each medical concept by considering the data availability and the ontology structure.

Methodology

In the following I will present briefly the methodology and the description of GRAM. I will provide much in this review, as this will certainly encourage the reader to read for him/herself the full paper. This is, as is almost always the norm for the information age, a paper worth to delve into the full details.

The electronic health records (EHR) mentioned earlier are actually pieces of code, and treated as so conceptually by the authors here. Further the dataset is vectorized as binary vector combined with a vocabulary of size |C|:

We assume that a given medical ontology G typically expresses the hierarchy of various medical concepts in the form of a parent-child relationship, where the medical codes C form the leaf nodes. Ontology G is represented as a directed acyclic graph (DAG) whose nodes form a set D = C + C′ .  C′ = {c|C|+1, c|C|+2, . . . , c|C|+|C′|} defines the set of all non-leaf nodes (i.e. ancestors of the leaf nodes), where |C′| represents the number of all non-leaf nodes. We use knowledge DAG to refer to G. A parent in the knowledge DAG G represents a related but more general concept over its children. Therefore, G provides a multi-resolution view of medical concepts with different degrees of specificity. While some ontologies are exclusively expressed as parent-child hierarchies (e.g. ICD-9, CCS), others are not. For example, in some instances SNOMED-CT also links medical concepts to causal or treatment relationships, but the majority relationships in SNOMED-CT are still parent-child. Therefore, we focus on the parent-child relationships in this work.

It is of note here the way the medical ontology is represented as a directed acyclic graph (DAG). These kinds mathematical and computer science objects play an important role in machine or statistical learning algorithms, due to their properties. For instance they are used to represent collections of events and their influence on each other in a probabilistic setting such as a Bayesian network or as a record of historic data in family trees or distributed revision control systems. The implementation here is clearly as a record of historic data in a family tree.

gramfig1

GRAM leverages the parent-child relationship of G to learn robust representations when data volume is constrained. GRAM balances the use of ontology information in relation to data volume in determining the level of specificity for a medical concept. When a medical concept is less observed in the data, more weight is given to its ancestors as they can be learned more accurately and offer general (coarse-grained) information about their children. The process of resorting to the parent concepts can be automated via the attention mechanism and the end-to-end training as described in Figure 1.

(…)

In the knowledge DAG, each node ci is assigned a basic embedding vector ei ∈ R m, where m represents the dimensionality. Then e1, . . . , e|C| are the basic embeddings of the codes c1, . . . , c|C| while e|C|+1, . . . , e|C|+|C′| represent the basic embeddings of the internal nodes c|C|+1, . . . , c|C|+|C′|. The initialization of these basic embeddings is described in Section 2.4. We formulate a leaf node’s final representation as a convex combination of the basic embeddings of itself and its ancestors:

gramform1

These figures and formulas are somewhat involved mathematically. A challenge to our brains, but worth it. But wait and check how the attention weight is introduced with a Softmax function, where the input function is computed with a multilayer perceptron (MLP) activated with a hyperbolic tangent function:

The attention weight αij in Eq. (1) is calculated by the following Softmax function, 

gramform2

f(ei , ej ) is a scalar value representing the compatibility between the basic embeddings of ei and ek. We compute f(ei , ej ) via the following feed-forward network with a single hidden layer (MLP),

gramform3

where Wa ∈ R l×2m is the weight matrix for the concatenation of ei and ej , b ∈ R l the bias vector, and ua ∈ R l the weight vector for generating the scalar value. The constant l represents the dimension size of the hidden layer of f(·, ·). Note that we always concatenate ei and ej in the child-ancestor order.

As is widely acknowledged in the machine/deep learning community of practitioners and developers, one of the its underlying problems is how to initialize an algorithm/framework when there is missing data. Here the authors were confronted with the same issue that was dealt with in following way:

The attention generation mechanism in Section 2.2 requires basic embeddings ei of each node in the knowledge DAG. The basic embeddings of ancestors, however, pose a difficulty because they are often not observed in the data. To better initialize them, we use co-occurrence information to learn the basic embeddings of medical codes and their ancestors. Co-occurrence has proven to be an important source of information when learning representations of words or medical concepts (Mikolov & Dean, 2013; Choi et al., 2016c;d). To train the basic embeddings, we employ GloVe (Pennington et al., 2014), which uses the global co-occurrence matrix of words to learn their representations. In our case, the co-occurrence matrix of the codes and the ancestors was generated by counting the co-occurrences within each visit Vt, where we augment each visit with the ancestors of the codes in the visit. Details of training the basic embeddings are described in the Appendix A.

The reader may have noticed how the setting above not only initialize the setting for the implementation of the GRAM framework, but also augments the dataset with each co-ocurrence within each visit to the medical ontology. So this might in the end help the deep learning system maximize its performance

Conclusion

As was earlier remarked, I encourage every reader to follow through the rest of the paper, where the experiments are described in full detail, it is provided links to the GitHub repository of the framework, a common practice in our times of open-source software (not necessarily easy software, but being open is better than being closed…), and all there is to know about it in the references and appendices. It is well worth the full reading, as this paper provides graphs and tables of compelling visualization and well captioned. But I will leave it for now here with just the conclusion by the authors with their last paragraph:

Data insufficiency, either due to less common diseases or small datasets, is one of the key hurdles in healthcare analytics, especially when we apply deep neural networks models. To overcome this hurdle, we leverage the knowledge DAG, which provides a multi-resolution view of medical concepts. We propose GRAM, a graph-based attention model using both a knowledge DAG and EHR to learn an accurate and interpretable representations for medical concepts. GRAM chooses a weighted average of ancestors of a medical concept and train the entire process with a predictive model in an end-to-end fashion. We conducted three predictive modeling experiments on real EHR datasets and showed significant improvement in the prediction performance, especially on low-frequency diseases and small datasets. Analysis of the attention behavior provided intuitive insight of GRAM.

featured image: Wkipedia page of Directed acyclic graph

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s