Getting Started with Python Strings for Bioinformatics (and everyone else!)
Nowadays, most Bioinformatics work is done using the programming languages Python and R. These languages are used to analyze data ranging from single DNA sequences to massive spreadsheets of complex information. As an introduction, we are going to be examining some important functions and methods for strings as they are commonly used data types to store genomic sequences.
To assign a string to a variable, start by typing the variable name on the left followed by an equals sign and then your string. If your string is a single line, you can use either single or double quotes. If your string is multi-line, you can use triple single or triple double quotes. Note: Whatever type of quotation you start with MUST be what you end with as well, or else you will find yourself receiving errors.
some_string_name = 'This is a string.'
some_other_string = "This is another string."
a_third_string = '''This is one way of
typing a multi-line string!'''another_string = """This is another
You can also assign multiple variables at the same time by separating each variable and assigned strings with commas. Note: seq is short for sequence. We will be using this as we go through the functions/methods.
seq1, seq2, seq3 = 'ATCG', 'TATA', 'GATTACA'
Another common task using genomic sequence strings is to “concatenate” or combine them together. Concatenating in Python is as easy as using a plus sign.
seq1 = 'ATCG'
seq2 = 'TATA'seq3 = seq1 + seq2
If we did “print(seq3)”, we would be shown the new string “ATCGTATA”. One thing to remember is that unless you’re using what is concatenated immediately, be sure to assign it to a new variable or overwrite an existing one!
Some String Functions and Methods
The first method we are going to take a look at is the “count()” function. Count does exactly what it sounds like; it counts whatever character or string you specify. One common task in genomic analysis is to count the number of each kind of nucleotide is present in the genetic sequence. Using count we would have something like the following.
cytosine_count = seq.count('C')
If we had a sequence such as “ATCGATCGATCG”, cytosine_count would evaluate to 3.
The syntax for the count method is list.count(‘value’).
The split function turns a given string into an array where each element in the array is a portion of a string up until the delimiter that you provide. Note: a delimiter is something that separates part of a sentence such as commas, spaces, colons, etc. If we have some string separated by “|” such as “thing 1, thing 2, thing 3”, we can use the split method to break the string into an array.
some_string = "thing 1 | thing 2 | thing 3"split_string = some_string.split("|")
Printing “split_string” will give us an array of [‘thing 1’, ‘ thing 2’, ‘ thing 3’]. Notice that there are still spaces before thing 2 and thing 3 because we split it along the | character.
The slice() function, as the name suggests, returns a slice object. slice() has three parameters, start, end, and step. Start and step are optional, and end is required. Start allows you to specify an integer number for the index where you want to start slicing, with the default being 0. End is an integer number that specifies at which position you want to end the slicing. Step is an integer specifying the “step” of the slicing. For example, a step of 1 grabs every element between start and end. If step was set to 2, it would grab every other element between start and end.
# Given the string below
some_string = "Hi this is a string"# Set x to a slice object with a start of index 3, and end of index 5
x = slice(3, 5)# Prints "this is" by grabbing the portion of the string that the slice object corresponds to.
The join method works on an iterable, joining them together into one string separated by a specified character.
some_tuple = ("thing1", "thing2", "thing3")x = "#".join(some_tuple)print(x)
# Prints "thing1#thing2#thing3"
A quick tip: combining join() with split() can be quite powerful. For example, let’s say that you have a large DNA sequence. One issue that comes up quite often are newline characters. The large sequence that you have is split by the ‘\n’ character that you cannot see, but your computer can. If you try to analyze the sequence, the newline characters may make the sequence seem smaller than it actually is. We can use the join and split methods in the following way:
seq_joined = ''.join(sequence.split('\n'))
Let’s break this down. We are creating a new variable called seq_joined. The portion ‘’.join() is specifying that we want to join whatever we have in the parentheses with nothing. sequence.split(‘\n’) takes the DNA sequence that we called “sequence” and split the string by the newline characters. Combined this means that we first split the string by the newline characters, then joining the newly split string (temporarily held in an array) with nothing. Thus joining the string together into one long sequence of DNA nucleotides.
— — — —
These are just some of the many things that you can do with strings. Check out https://www.w3schools.com/python/python_ref_string.asp for a list of Python string methods!