Home  > What is bioinformatics ? > A short introduction to bioinformatics

Searching for homology through similarity of sequences

Translated from "Donner un sens au génome", La Recherche, n° 332, June 2000


To return to sequencing, if we use the classic metaphor in which the DNA bases are seen as letters, then once the text (the sequence) has been obtained, the first difficulty is to identify the words (the genes) which make it up. Next comes the question of the meaning - the function of the genes.

A biologist’s first reflex, when a new sequence is available, is to compare it, together with its potential translations into protein sequences, with those already held in banks and databases, looking for similar rather than identical sequences. With the exception of sequencing errors, any differences represent mutations which have accumulated in the course of evolution. If there is enough similarity, the two fragments are considered to result from divergent evolution from one ancestral fragment, and they are said to be homologues. If the fragment includes a gene, homology suggests that the proteins it codes for have a similar function, but it does not prove this, as will be seen later. The search for similarity has led to a wealth of technical and methodological developments, both to shorten the computer run time, when a sequence is compared to all the sequences that are already known, and also to take prior knowledge about evolutionary mechanisms into account when designing algorithms. There are limits to what this strategy can achieve. A similarity search may fail simply because no homologous sequence has yet been identified. When the yeast genome was sequenced, almost half its genes were completely unknown, and they did not resemble anything found in the banks. Such genes are known as "orphans". Besides, relying exclusively on the information in the databanks means that if this information is incorrect, as is all too often the case, the errors are propagated, resulting in what some researchers call a "house of cards". So it is essential to have access to direct gene identification methods which do not rely on homology. This research is much easier when the genome in question comes from a prokaryote (a bacterium) than if it comes from a eukaryote (any other organism).