This post discusses how we use BERT and similar self-attention architectures to address various text crunching tasks at Ether Labs.

Self-attention architectures have caught the attention of NLP practitioners in recent years, first proposed in Vaswani et al., where the authors have used multi-headed self-attention architecture for machine translation tasks

BERT’s Attention of the word ‘It’ in one of the layers

Multi-headed attention enhances the ability of the network by giving attention layer multiple subspace representations — each head weights are randomly initialised and after training, each set is used to project input embedding into different representation subspace

BERT Architecture Overview

  • BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al.
  • Each word in BERT gets “n_layers*(num_heads*attn.vector) “ representations that capture the representation of the word in the current context
  • For example, in BERT base: n_layers = 12, N_heads = 12, attn.vector = dim(64)
  • In this case, we have 12X12X(64) representational sub-spaces for each word to leverage
  • This leaves us with a challenge and opportunity to leverage such rich representations unlike any other LM architectures proposed earlier

Sentence relatedness with BERT

BERT representations can be double-edged sword gives the richness in its representations. In our experiments with BERT, we have observed that it can often be misleading with conventional similarity metrics like cosine similarity. For example, consider pair-wise cosine similarities in below case (from the BERT model fine-tuned for HR-related discussions):

text1: Performance appraisals are both one of the most crucial parts of a successful business, and one of the most ignored.

text2: On the other, actual HR and business team leaders sometimes have a lackadaisical “I just do it because I have to” attitude.

text3: If your organization still sees employee appraisals as a concept they need to showcase just so they can “fit in” with other companies who do the same thing, change is the order of the day. How can you do that in a way that everyone likes?

text1<>text2–0.613270938396454

text1<>text3–0.634544332325459

text2<>text3–0.772294402122498

A metric that ranks text1<>text3 higher than any other pair would be desirable. How do we get there?

OOTB, BERT is pre-trained using two unsupervised tasks, Masked LM and Next Sentence Prediction (NSP) tasks.

Masked LM is a spin-up version of conventional language model training setup — next word prediction task. For more details, please refer to section 3.1 in the original paper.

Next Sentence Prediction (NSP) task is a novel approach proposed by authors to capture the relationship between sentences, beyond the similarity.

For the above text pair relatedness challenge, NSP seems to be an obvious fit and to extend its abilities beyond a single sentence, we have formulated a new training task.

From NSP to Context window

In a context window setup, we label each pair of sentences occurring within a window of n sentences as 1 and zero otherwise. For example, consider the following paragraph:

As a manager, it is important to develop several soft skills to keep your team charged. Invest time outside of work in developing effective communication skills and time management skills. Skills like these make it easier for your team to understand what you expect of them in a precise manner. Check in with your team members regularly to address any issues and to give feedback about their work to make it easier to do their job better. Encourage them to give you feedback and ask any questions as well. Effective communications can help you identify issues and nip them in the bud before they escalate into bigger problems.

For context window n=3, we generate following training examples

Invest time outside of work in developing effective communication skills and time management skills. <SEP> Check in with your team members regularly to address any issues and to give feedback about their work to make it easier to do their job better. Label: 1

As a manager, it is important to develop several soft skills to keep your team charged. <SEP> Effective communications can help you identify issues and nip them in the bud before they escalate into bigger problems. Label: 0

Effective communications can help you identify issues and nip them in the bud before they escalate into bigger problems. <SEP> Check in with your team members regularly to address any issues and to give feedback about their work to make it easier to do their job better. Label: 1

This training paradigm enables the model to learn the relationship between sentences beyond the pair-wise proximity. After context window fine-tuning BERT on HR data, we got following pair-wise relatedness scores

text1<>text2–0.1215614

text1<>text3–0.899943

text2<>text3–0.480266

This captures the sentence relatedness beyond similarity. In practice, we use a weighted combination of cosine similarity and context window score to measure the relationship between two sentences.

Document Embeddings

Generating feature representations for large documents (for retrieval tasks) has always been a challenge for the NLP community. Approaches like concatenating sentence representations make them impractical for downstream tasks and averaging or any other aggregation approaches (like p-means word embeddings) fail beyond certain document limit. We have explored several ways to address these problems and found the following approaches to be effective:

BERT+RNN Encoder

We have set up a supervised task to encode the document representations taking inspiration from RNN/LSTM based sequence prediction tasks.

[step-1] extract BERT features for each sentence in the document

[step-2] train RNN/LSTM encoder to predict the next sentence feature vector in each time step

[step-3] use final hidden state of the RNN/LSTM as the encoded representation of the document

This approach works effectively for smaller documents and is not effective for larger documents due to the limitations of RNN/LSTM architectures.

Distributed Document Representations

Generating a single feature vector for an entire document fails to capture the whole essence of the document even when using BERT like architectures. We have reformulated the problem of Document embedding to identify the candidate text segments within the document which in combination captures the maximum information content of the document. We use the following approaches to get the distributed representations — Feature clustering, Feature Graph Partitioning

Feature clustering

[step-1] split the candidate document into text chunks

[step-2] extract BERT feature for each text chunk

[step-3] run k-means clustering algorithm with relatedness score (discussed in the previous section) as a similarity metric on candidate document until convergence

[step-4] use the text segments closest to each centroid as the document embedding candidate

A general rule of thumb is to have a large chunk size and a smaller number of clusters. In practice, these values can be fixed for a specific problem type

Feature Graph Partitioning

[step-1] split the candidate document into text chunks

[step-2] extract BERT feature for each text chunk

[step-3] build a graph with nodes as text chunks and relatedness score between nodes as edge scores

[step-4] run community detection algorithms (eg. The Louvain algorithm)to extract community subgraphs

[step-5] use graph metrics like node/edge centrality, PageRank to identify the influential node in each sub-graph — used as document embedding candidate

Conclusion

This post highlights some of the novel approaches to use BERT for various text tasks. These approaches can be easily adapted to various usecases with minimal effort. More to come on Language Models, NLP, Geometric Deep Learning, Knowledge Graphs, contextual search and recommendations. Stay tuned!!

Checkout EtherMeet, an AI-enabled video conferencing service for teams who use Slack.

Sign up at etherlabs.io