Bag-of-words and k-mers

Bag Of Words

The bag of words (BOW) is a strategy usually adopted when we need to convert a document of strings, e.g., words in a book, into a vector of pure numbers that can be used for any sort of mathematical evaluation.

At first, this method was used in text and image classification, but it has recently been introduced into bioinformatics, and can be successfully applied to DNA/AA sequences’ repertoires.

With the BOW approach, we can redefine a highly complex document, such as a picture or a book, into a smaller set of low-level features, called codewords (also a codebook, dictionary or vocabulary).

The quality and origin of features is arbitrary, i.e. if we want to analyse a book or a series of books, we can choose as features all the words present inside the books, or the letters, or the combination of letters. As for the origin, the features can be all words present in the same books or all the words in the English dictionary, etc. As a result, the length of a codebook is defined by the number of features chosen.

  1. Tom likes to go to the cinema on Sunday.
  2. Martin likes to go to the cinema on Tuesday.
  3. George prefers to visit science museums.
  4. Carl prefers to visit art museums.

First, we need to choose what type of feature we want to use. Because we are using four short sentences, we can use all the words present in the whole document.

Carl 0001

Codeword for each sentence/item: In this table is represented each sentence as vector. In the first column there is the codeword with all single words present in the documents. In the following columns are listed the frequency of that word in each sentence. As result each sentence is now turned into a vector of numbers, in which any position of the vector represents a word and the numerical value its frequency in the sentence.

At this point, we can use these numerical vectors for any type of mathematical analysis.

For example, we can consider each vector as a point in an l-dimensional space, where l is the length of the codebook.

We can measure the distance between any points using the Euclidian distance. In the table below we can see that items 1-2 and items 3-4 are closer to each other than the other combinations.


Euclidean distance between items: In this table is represented the Euclidian distance between each vector originated by the table above. With this table we can see that vectors which originated from sentences with more words in common have a shorter distance.

Application to the protein sequence repertoires

The first problem we encounter when we apply the BOW model to the protein repertoires is the choosing of the codeword features. If we use all elements of the protein repertoires in the same way as we did for the example in the previous section, we would have an incredibly long codeword. As long as the protein sequence.

This would lead to a codeword of an extremely high number of dimensions, posing several problems not just for the computational time required for its calculation, but also for the over-fitting issue that machine learning methods will incur.

The solution adopted here is to use all contiguous sub-strings of amino acids present in the repertoires. This approach is called k-mers.


By the term k-mer is indicated all possible sub-strings of length k that are contained in a string. In our case we will refer to all possible sub-sequences of amino acids (of length k) present in the AA sequence. Namely, single amino acids for k=1, duplets for k=2 and so on. k-mers are usually used for sequence analysis, motif identifications and genome assembly algorithms.

The number of all possible k-mers present will form the codebook size that we will use. This size grows exponentially with the formula n^k, where n is the given possibilities and k is the length of the sub-string, i.e.: A sub-string of two amino acids would have all the combinations of 20 amino acids, such as: AA, AC, AD etc., for a total of 400 possible combinations.

Once we have chosen the size of our string, we divide our sequences into all continuous subsets of that size. The number of k-mers for a given sequence of length L is: L-k+1. In other words, for a string size of 2 and a sequence of 10 amino acids we obtain 9 features, as from the first position of the sequence, we move our string one position at a time, until we reach the penultimate amino acid.


The codewords will all be triplet combinations of 20 amino acids, for a total of 8,000 features:

AAA (1), AAC (2), AAD (3), … , YYV (7998), YYW (7999), YYY (8,000)

The sequences will then be parsed as in Figure below:

Continuous sub-strings of k-mer in a protein sequence: In this plot I represented all the continuous sub-strings of k-mer of length 3, referred as triplets, present in a protein sequence. On the top, is present an example of a protein sequence sequence: CASSLSGSFFGPG, we proceed from the left-hand side of the sequence and we select the first sub-string of three amino acids (triplet), then, we slide one position to the right, selecting all sub string until the end of the sequence. All continuous sub string of triplets we select are listed below the sequence: CAS, ASS, etc. The total number of sub-strings is given by L-k+1, where L is the length of the string and k is the length of the sub-string, in this case the result is 11.

We repeat this process for all the sequences, then we just need to count the frequency of each triplet, the result being seen below:

Count of all triplets in a protein sequence repertoire: In this plot I represent a protein repertoire as a histogram with the frequency (y-axis) of each triplet (x-axis). We can see that some triplets are overwhelmingly overrepresented while others are almost if not totally absent.
0 0 vote
Article Rating
Notify of
Inline Feedbacks
View all comments