Introduction to Data Structures for Bioinformatics and Computational Biology
It is no secret that the amount of biological data being generated each day is increasing exponentially. In fact, there is evidence that in terms of genetic sequencing data is following its’ own version of Moore’s Law. There is no bottleneck in our ability to produce more data, but there is in processing it. This is where data structures and algorithms come in.
First off, what is a data structure? Wikipedia says that “in computer science, a data structure is a data organization, management, and storage format that enables efficient access and modification”. To put it a little more simply, a data structure is a representation of some thing that we want to manipulate.
Nearly all information stored in a computer is stored in some kind of data structure. Some common data structures that you may have heard of are linked lists, stacks and queues. There are a LOT of other data structures, but we will only focus on a few for now. For the purposes of these articles, we will assume that a good data structure will do the following for us:
1. Enable effective computational manipulation.
2. Produce correct results in finite amount of time.
Why do we want data structures to do these two things? Because we want to be able to take the data that we have and transform it in some way. Whether this be accessing it and visualizing it, or turning it into another form that will be easier for us to manipulate. We also don’t want to be waiting forever for our computers to run, so data structures that allow us to produce results in a finite amount of time is paramount.
Intervals
The first data structure that we will take a look at is called an interval. An interval is exactly what it sounds like. It is a sequence of integers. Intervals are typically written in the format: [ j..k ]. There are some key things to note here.
- Square brackets “[ ]” are used to denote that this is an interval.
- The letters j and k represent the numbers that the interval spans. i.e., this interval begins at j, ends at k, and includes both of those values.
- The two dots “..” indicate that there are some numbers in between the j and the k.
Different mathematical equations can be applied to determine what numbers are in the interval. For example, if we say that j is equal to 1 and k is 10, we could simply start with j and add 1 to produce the next number, add one again to produce the third number, etc. all the way up to 10. Or we could have some massive quadratic equation and end up with a k value of 1,894,182. It’s really up to you.
To go one step further, we can say that there is some number i in our interval of [ j..k ]. This is the same as saying that i is an integer such that, j ≤ i ≤ k. Why is this true? Because we said that our interval starts at j, and ends at k. In order for i to be a part of the interval, it MUST be in between j and k. Therefore i must be greater than or equal to j, and less than or equal to k.
Tuples
The next data type we are going to take a look at is a very important one called a tuple.
A tuple is an ordered sequence of n elements, some of which may be repeated.
Why is a tuple so important? Because it is one of the best data structures to represent a DNA, RNA, or protein sequence. We’ll come back to why this is, but for now, here’s a picture of a random DNA sequence with its’ base pairs shown. Can you think of why a tuple is great for representing this?
A tuple is typically denoted like the following: <x1, x2, …, xn>. Again, there are a couple of things to note here.
- The < and > symbols are what shows that this is a tuple. If we simply wrote “<>”, that would mean that we have an empty tuple.
- x1, x2, … xn means that this tuple starts at x1 and continues until some element xn. (Note: xn is read as “x sub n”, and represents some element x at position n).
Two other helpful tips are what |T| and T[i] mean. |T| is the same as saying “the size of tuple T”, or the number of elements inside of tuple T. T[i] means “the element at position i of tuple T”.
We can combine what we know from the interval data structure and write something like T[i, j] which is the same as <T[i], T[i + 1], …, T[j]>. Or, the Tuple T that ranges from values i to j. Note that in this case we are saying that this tuple strictly holds integers, but in reality a tuple can hold any kind of element. One note on tuples: unlike something like a list, a tuple is immutable meaning that once it has been created, it cannot be changed. The elements are fixed in order. But above we said that tuples were great to use when working with genomic data! How great are they if they are fixed? Well, there are a handful of commonly used functions in programming with tuples that we will go over now.
- Head. Writing “Head(T)” means “Return T[0]”, or return the first element of tuple T.
- Tail. Writing “Tail(T)” means “Return T[|T|]”, or return the last element of tuple T.
- Push. Writing “Push(x, T)” means take element x and “push” it into T. Or in other words, take element x and return a new tuple with x as the first element of tuple T. Because we cannot change T, we are instead creating a new tuple with x, and then copying all values of T to it afterward.
- Pop. Writing “Pop(T)” means remove the first element of tuple T. Or in other words, return a new tuple that ranges from the second element of T to the last (T[1…|T|). Note: remember that indexing in programming starts at 0.
- Inject. Writing “Inject(x, T)” means to add x to the end of tuple T. Again this involves creating a new tuple, and adding element x as the new last element.
- Eject. Writing “Eject(T)” means to return all elements in tuple T EXCEPT for the last element.
Using all of these functions together in different ways allows us to develop new data structures. These include stacks, queues, and deque.
Stack
The traditional way to think of a stack, is of a stack of plates. Stacks are tuples that have the functions head, push, and pop. When a plate is added to the top of the stack, the only way to get further down the stack is by removing plates until you get to the one that you want. The same is true for elements in a stack.
Queue
A queue is a line. At some point each and every one of us has stood in a line. Queues are tuples that use the functions head, inject, and pop. We can find out who is first in line (using head), we can add someone to the end of a line (using inject), and we can let someone into the store by removing them from the front of the line (pop).
Deque
No, that’s not a typo. A deque is a double ended queue. While stacks and queues restricted us to only inserting and removing from one location, a deque allows for insertion and removal from both ends. Deques are tuples that use the functions head, tail, push, inject, pop and eject.
How does this relate to genomic sequences?…
Think about the picture of the DNA strand above. DNA is a continuous string of base pairs that is read from the 5' end to the 3' end. This is the ONLY way that DNA works. While things like DNA repair complicate matters a bit, we are going to continue to move forward with the assumption that DNA is only read in one direction, meaning that the data structures mentioned above are the best for storing and manipulating genetic information. They are not the only ones, but they are what we will focus on in upcoming articles.