I haven’t posted here much as my supervisor is a bit of a workaholic and its been tough to find time to do everything that needs to be done and engaging in healthy non-work activities. However, I have permission from my supervisor to share something from my work, a set of three simple scripts I wrote. All these scripts run in the UNIX shell.
They’re probably not the most efficient way to do things but, considering I only started to learn any programming last year, I’m rather proud of them. (I’m not counting some basic HTML or fiddling around with the OHRRPGCE plotscripting language which I used to do a few years ago.) You can find them here on Github.
For these to work I already had the correctly formatted protein sequences and gene groupings. The first script “aligningproteinfamilies” went through the list of groups and used T-Coffee to align those proteins. That result came out in Clustal format which I found tricky to work with, so I wrote “reformatclustal” to pull out only the bits that were important to me; the actual sequence and the consensus scoring. In the new format it was easier to look for a perfect alignment, find the position in the line and then extract the amino acid sequence corresponding to that position with “maxconslenresultv2.”