What is Text Similarity and How to Implement it?

Introduction to Text Similarity

Text Similarity is the process of comparing a piece of text with another and finding the similarity between them. It’s basically about determining the degree of closeness of the text. Dealing with text, sentences or words brings us to the region of Natural Language Processing (NLP), where we are going to use different NLP approaches to process the raw text and help the model to detect the similarity more swiftly and efficiently.

Now this is all cool, but let’s understand why we need to find similarity of text and even if sometimes we need it, what is the necessity of applying machine learning concepts for the purpose! In our daily life, we always need to compute the similarity in meaning between texts:

  1. Search engines need to model the relevance of a document to a query, beyond the overlap in words between the two. For instance, question-and-answer sites such as Quora or Stack Overflow need to determine whether a question has already been asked before.
  2. Selecting the most similar product for a customer shopping in any online platform if that exact product is unavailable.
  3. Checking similarity of multiple documents or letters.
  4. Choosing the most appropriate or closest job role or profile a person’s resume.

How does text similarity work?

Well, to start with our text similarity task we mainly need to convert the input text into a more machine-readable form by transforming the text into embeddings which then gets converted into vectors that are understood by the machine to calculate the similarity.

Algorithms to transform the text into embeddings

Term frequency-Inverse Document Frequency(TF-IDF)

  Before understanding the TF-IDF approach we will take a look into the crudest approach of converting words into embeddings through a document term matrix. In a Document term matrix where each row is a phrase, each column is a token and the value of the cell is the number of times that a word appears in the phrase.

Now, the major problem with this approach is that all words in the phrase are treated with the same importance, there is no option to prioritize the words. To overcome this problem, TF-IDF was introduced as a numeric approach that gives a certain importance measure to the word embeddings.

TF-IDF gets this importance score by getting the term’s frequency (TF) and multiplying it by the term inverse document frequency (IDF). The higher the TF-IDF score the rarer the term in a document and the higher its importance.

How to calculate TF-IDF?

The term frequency refers to how many times a term or word appears in the document. The IDF is logarithmic of the total number of documents divided by the total number of documents that contain the term. Both of these scores together represent a collective embedding of the word.

Word2Vec

The Word2Vec mainly uses neural networks algorithms to extract the contextual meaning of the corpus and develop the embeddings using methods like Continuous Bag of Words (CBOW) and Skip Gram. Both of these methods use one hot encoded representation of the input words. The weights in the hidden layers of the Word2Vec model are the embeddings that we want.

The CBOW model is a supervised learning neural network model that predicts the center word from the corpus words. It takes one hot encoding of the words as input and the output is the main word that can possibly add some sense to the neighboring words.

On the other hand, the Skip-Gram model works in just the opposite way of the CBOW model. It takes the target word as input and predicts the neighbouring words.

However, the problem with Word2Vec is that it provides a single representation for a word that is the same regardless of context. For example, there are words having multiple meanings like ‘clean’, ‘back’, ‘light’, etc. which convey different meanings in different situations, but the model will end up with a representation which is an average of the senses, not representing either one well.

Contextual Word Embeddings (ELMo)

This approach was developed to assign embeddings to words based on the context. This is achieved    with unsupervised training of a RNN model. The ELMo model concatenates a two-layer Bi-LSTM for the pre-trained language model and the embeddings by fine-tuning. ELMo also ditches Word2Vec completely by relying only on character level CNN/RNNs for the first part of the word representation.

Transformers

                The Transformers technique is revolutionizing the NLP problems and setting the state-of-art performance with the models BERT and Roberta. The main feature of the transformer is that it uses attention, the concept that helps with alignment in seq2seq architectures for translation, to capture relationships between the words of a sentence. The attention block captures similarity between the words in the space in which the weight matrices for the different heads project their representations.

For example, BERT uses the transformer block to train a language model using a masking technique where the system isn’t tasked with guessing the next word but rather one of the words masked out in the sentence. This way it is able to use the entire context for prediction and not only the left context.

Metrics to calculate the similarity between embeddings

To find the Similarity scores we have multiple metrics that measure vector distances, among which we are going to discuss the most efficient ones that are used.

Cosine Similarity

Cosine similarity measures the similarity between two vectors of an inner product space in general. It measures the cosine of the angle between two embeddings and determines whether they are pointing in roughly the same direction or not. When the embeddings are pointing in the same direction the angle between them is zero so their cosine similarity is 1. When the embeddings are perpendicular to each other  the angle between them is 90 degrees and the cosine similarity is 0 finally when the angle between them is 180 degrees the cosine similarity is -1.

            Cosine Similarity ranges from 1 to -1, where 1 represents most similar and -1 represents least similar.

For example:

cosθ = v1.v2 / ( |v1| * |v2| )

where, v1.v2 is dot product and  |v1| is magnitude of v

Let us say v1 and v2 are vectors.

            v1 = [ 1 3 2 ]

            v2 = [ 5 0 -3]

In this case, the dot product of v1 and v2 will be:

           v1 . v2 = 1 * 5 + 3 * 0 + 2 * -3 = 5 + 0 + -6 = -1

Magnitude will be:

           |v1| = 1^2 + 3^2 + 2^2 = 1 + 9 + 4 = 14

           |v2| = 5^2 + 0^2 + (-3)^2 = 25 + 0 + 9 = 34

 

Jaccard Similarity

Jaccard Similarity is also known as the Jaccard index and Intersection over Union. It is the intersection of two sentences/texts between which the similarity is being calculated divided by the union of those two which refers to the number of common words over a total number of words. The Jaccard Similarity score ranges from 0 to 1, where 1 represents most and 0 represents least similar.

For example:

Jaccard Similarity(d1, d2) = d1 ∩ d2 / d1 ∪ d2, which means common things between d1 and d1 / all things in d1 and d2 together

Let us say d1 and d2 are vectors.

     d1 = [ 1 3 2 ]

     d2 = [ 5 0 3]

In this case, d1 ∩ d2 is: [3] and d1 ∪ d2 is [1 3 2 5 0]

Jaccard similarity between d1 and d2 is 1/5 = 0.2

Euclidean Distance

The Euclidean Distance formula is the most common formula to calculate distance between two points/coordinates. To get it we just need to subtract the points from the vectors, raise them to squares, add them up and take the square root of them. Similarly, it’s used to calculate the distance between the embedding vectors of the words, supposing each vector represents a set of coordinates.

   Euclidean Distance = sqrt(∑(xi−yi)^2), where i = 1 to i = n (number of vectors)

For Example:

Let us say v1 and v2 are vectors.

     v1 = [ 1 3 2 ]

     v2 = [ 5 0 -3]

The Euclidean distance will be:

Euclidean distance = sqrt ( (1-5)^2 + (3-0)^2 + (2-(-3))^2 )

                                            = sqrt ( 16 + 9 + 25 )

                                            = sqrt (50)

                                            = 5 * sqrt(2)

 

In my honest opinion, Jaccard Similarity takes a set of unique length of words instead cosine similarity takes the whole sentence vector. If data duplication does not matter, then it’s better to use jaccard similarity else cosine similarity is good for measuring the similarity between two vectors even if the data duplication is there.

            And Euclidean Distance takes a bit more time and computation power that the other two. To choose between Euclidean Distance and Cosine similarity is a question of orientation and not magnitude.

How to get ready for the text similarity task ?

Well, now that we know the steps of how the text similarity thing actually works, we need to do some trimming and dressing up of our input data(texts) and make it more ‘presentable’ to the model. Now this is where Natural Language Processing comes in the play…playing a crucial part in crunching the data. These steps help to deliver faster and efficient results by making it easier for the machine learning model to predict the similarity.

Now, let’s have a look at how NLP makes the text more machine-readable. 

Text Preprocessing using NLP

The preprocessing involves the following steps as required: –

  • Normalization – This step involves removing all the special characters, links, hashtags, or punctuations and transforming the text into lower case. Numerics can also be removed to avoid prioritizing any word embeddings.

Text Similarity implementation

  • Tokenization – After the above step, depending on the operation, we can create a corpus of the whole text and include all in a single list. Now this corpus/text is splitted into lists of tokens (smaller phrases/parts).NLTK Tokenizers can be used to tokenize the text.

  • Removing Stopwords –  ‘Stopwords’ refer to those words that are most commonly used in a language and do not add much meaning to the text. They just enhance the vocabulary but are useless for our purpose like articles or prepositions. We can also include our list of custom stopwords depending on the operation. Examples: ‘the’,‘a’,’an’,’or,‘will’,’and’,’can’,etc.

Text Similarity implementation

  • Stemming – Stemming is the process to get the ‘stem word’ and sometimes this stemmed word is not equal to the morphology or meaning of the word, but the goal of stemming is to map that related word to the same stem without considering the context. We can use Snowball stemmer or Porter stemmer for this operation.  Example: ran and running become run.

  • Lemmatization – Lemmatization is the process of mapping the root/base form of a word known as lemma. So, this is similar to stemming, except that the lemma is a dictionary word and the context is also considered. As a result it is slower and more accurate than stemming. Once lemmatization is done, we can extract the concentrated keys(lemmas) using the post tags and append them to a list as the processed input.

Text Similarity implementation

 

Understanding Lexical Similarity and Semantic Similarity

When we are working on similarity projects, we mainly need to compute how ‘close’ two pieces of text are, either in meaning(semantic approach) or in surface closeness(Lexical approach).

Lexical Similarity

When we are looking for similarity between two texts superficially, considering only the word to word comparison and no contextual comparison, it’s called the lexical approach. Here the overlapping of both the word sets are compared to extract the similarity score.

For example if we compare the sentence: The cat ate the mouse.; with the sentence: The mouse ate the cat.; the similarity score will be very high as the words in both the sentences are almost same whatever meaning it makes.

Lexical similarity can be computed at various Granularities. It can be computed at the character level(string similarity), word level or at phrase  level (or lexical chain level) with the concentrated lemmatised phrases.

Where is lexical similarity used ?

  • Clustering – Grouping of similar texts together based on their similarity ?
  • Keywords matching – Selecting texts based on given keywords like finding resumes with similar skills set keywords or job role keywords.

Semantic Similarity

Semantic similarity refers to the similarity of two pieces of text when their contextual meaning is considered. It judges the order of occurrences of the words in the text.

Types of Semantic similarity:

  • Knowledge-Based Similarity – Determining semantic similarity between the concept of corpus.
  • Statistical-Based Similarity Determining semantic similarity based on learning features’ vectors from the corpus.
  • String-Based Similarity – Combines the above two approaches to find the similarity between non-zero vectors.

Semantic similarity is often used to address NLP tasks such as paraphrase identification, automatic question answering and removing similar sentences(redundancy removal).

For example, if there are two sentences: The horse is carrying the load; and the load is carrying the horse,  where we try to find the semantic similarity score between them then the score will be very less, as the context of the sentences are being evaluated. If lexical similarity approach was used for the same, the score would have been comparatively higher as surface level words are the same.   

Different open source Text Similarity algorithms available today!

GloVe model

GloVe is an unsupervised learning algorithm for obtaining vector representations for words.

These vector representations are achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.

Pre-processed sentences in the form of a list will be fed to the GloVe model and it will provide vector representation. And, the vector representations of 2 sentences can be processed into a similarity analysis model (like Cosine, Euclidean, Jaccard) to get the similarity score.

GloVe stands for Global Vectors for word representation. The Global Vectors built on the concepts global matrix factorization and local context window.

Now, let us understand how this works:

  1. Global matrix factorization: In NLP, global matrix factorization is the process of using matrix factorization methods from linear algebra to reduce large term frequency matrices which represent the occurrence or absence of words in a document. Global matrix factorizations when applied to term frequency matrices are called Latent Semantic Analysis (LSA).Output of this step is a vector of a word Wⱼ is the jᵗʰ vector from reduced rank space matrix.

  1. Local context window: Local context window methods are CBOW and Skip Gram which we have discussed earlier. Skip-gram works well with small amounts of training data and represents even words that are considered rare, whereas CBOW trains several times faster and has slightly better accuracy for frequent words.

 Instead of extracting the embeddings from a neural network that is designed to perform a single task like predicting neighboring words (CBOW) or predicting the focus word (Skip-Gram), the embeddings are optimized directly so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other. This forces the model to encode the frequency distribution of words that occur near them in a more global context.

Disadvantages of the GloVe model:

  • The model is trained on the co-occurrence matrix of words, which takes a lot of memory for storage, especially if hyperparameter tuning is done.
  • Cannot handle or learn the representation for out-of-vocabulary words (not used in training).
  • How to separate some opposite word pairs.For example, it fails to differentiate between ‘good’ and ‘bad’, as they are usually located very close to each other in the vector space.

 Read the official paper published by Stanford on GloVe for more details.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

fastText model

The fastText model is another word embedding method developed by the Facebook NLP research team. This is an extension of the word2vec model and works similar to Glove Model. Instead of learning vectors for words directly, fastText represents each word as an n-gram of characters. This helps capture the meaning of shorter words and allows the embeddings to understand suffixes and prefixes. Once the word has been represented using character n-grams, a skip-gram model is trained to learn the embeddings. The fastText model works well with rare words. So even if a word wasn’t seen during training, it can be broken down into n-grams to get its embeddings.

As for the disadvantages, since it is based on word2vec, it’s already inferior to GloVe or BERT. Along with this, it also cannot handle out of vocabulary words.

Have a look at the official Github repository of the fastText model.

BERT model

BERT stands for Bidirectional Encoder Representations from Transformers and is a language representation model by Google. It uses two steps, pre-training and fine-tuning, to create state-of-the-art models for a wide range of tasks, which means unlike the above two models it has no difficulty in handling out of vocabulary words.

Infact BERT is not a ‘bag of words’ conventional approach where we are converting words into term-frequency matrices, then embeddings and feeding it into a LTSM or RNN model.  BERT is based on the transformers architecture where multiple layers of Transformer’s Encoders are stacked.It has the ability to embed meaning of words into densely packed vectors known as dense vectors. This is because every value within the vector has a value and has a reason for being that value unlike the sparse vectors, such as one-hot encoded vectors where the majority of values are 0.

Explore the Github repository of BERT to understand how to use it in real life codes.

The best part about BERT is we can fine tune the base model with more suitable hyper parameters to increase its adaptiveness to our text. Let’s have a look at the BERT architecture:

BERT architecture

BERT has outperformed most of the language processing models which follow the traditional approaches with the revolutionary approach of the Transformers to play with texts and words. It’s by far one of the most favoured models in the NLP community because of its robustness and high feasibility with only one disadvantage that it is very compute-intensive at inference time, meaning that if you want to use it in production at scale, it can become costly.

Conclusion

So now that we have discussed the best possible ways to deal with text similarity tasks, you can explore these algorithms and find more details about them. There are many other algorithms and metrics too, and basically based on the scenario or problem statement, you can choose any model to extract the word embedding vectors of the pre-processed texts and check the similarity score. I have worked with the GloVe model and different BERT-based transformer models, which has been developed by the ‘HuggingFace’ community.

According to me, I think the transformers is the best as it can handle out of vocabulary words so easily and in a fast growing world like this, millions of words are being born, specially in the IT world and it’s practically impossible to train the models using Word2Vec approach like GloVe.

Explore Related Content