Hello, Darwin’s Dogs members! It is I, our group’s staff bioinformaticist, here to answer the ever-popular question: what is bioinformatics? Also known as computational biology, bioinformatics is just what you might think from the name: the application of computer science techniques to the study of biology.
Now, what does computer science have to add to biology? Suppose we are studying a gene that we suspect is mutated in cancer cells in dogs. We decide to investigate this by sequencing samples of cancerous cells and samples of normal cells so that we can compare the mutations we find.
We gather our data and we find that all of the cancerous samples are mutated in the candidate gene, but that none of the normal samples are! This provides intuitively clear evidence that the gene is, indeed, involved in cancer.
Often, though, we don’t know which gene (or genes) might be mutated in cancer. Instead, we only know that such genes exist somewhere in the genome. To find them, rather than just looking at one gene, we need to look at every gene! Imagine having to look through mutations in 20,000 genes by hand: not my idea of a fun Friday night (and Saturday, and Sunday, and…).
This is where computers come in: they are great at doing the same thing over and over. With some computer code, we can easily look through all of these genes and find which ones are mutated in the cancer cells, and we can do it quickly as well.
This general situation has cropped up more and more across the field of biology. As our tools improve, we are able to gather more and more data, but if you have lots of data, you need computers to make sense of it all. This has caused the creation of the field of bioinformatics from both sides: biologists learned computer science so that they could analyze their data, and computer scientists learned biology so that they could work on the interesting computational problems being created.
Within bioinformatics, there are many different areas of work. They include creating machine learning algorithms to find hidden patterns in complex data, engaging in software development that improves the usability and reliability of our computational tools, or data visualization applications that help us intuitively explore datasets, to name a few. There are also many existing bioinformatics tools for all sorts of tasks, so you don’t need to be able to write everything from scratch to do interesting work.
It may be helpful for me to provide a concrete example from my own work. My position is mainly support-oriented: group members have a range of computer knowledge, and when they encounter computational problems that they lack either the time or the expertise to handle, I work with them to solve the problems they encounter. I also perform a lot of the computational work on group-wide projects.
One of the most common tasks I work on is creating a pipeline, which is a piece of code that allows a standard sequence of programs to be executed as a single command. This means that we can process a large number of data files of a similar type easily, quickly, and consistently. It also means that when we want to process similar files again in the future, we don’t have to spend time recreating the process. Instead, we can reuse the pipeline we already wrote, saving us even more time!
One pipeline I have written takes a file containing genomic information from a particular dog and aligns to a standardized representation of the dog genome. While each dog is genetically much more similar to any other dog than to, say, a human, there are still many differences from dog to dog, and matching to the standard reference genome makes it easier to find genetic differences among individuals that could help to explain their different appearances and behaviors. Because we have many samples from different dogs, having this pipeline saves the time of manually starting each program in the process, waiting until it finishes (which may take hours, or even days), and then passing the resulting files to the next program. It also ensures that all of the samples are run through in the same way, so that the we’re sure that any differences between them reflect true differences between the dogs rather than differences between how the samples were processed.
Pipelines are just one of the many bioinformatics systems helping to drive biology (and Darwin’s Dogs!) forward. The field of bioinformatics is new and exciting, and I hope this post has helped provide a sense of what it involves!