Bioinformatics Tools

Pages

Tuesday, December 27, 2011

Circular dichroism code to help in data analysis

I was looking for some kind of code for rearranging the data I get for thermal melt from CD (Circular Dichroism). No I could not get a code to convert .jsw files to CSV in batch, neither JASCO’s Spectrum Analysis software helps on that, update me if there's batch conversion option for .jsw files to CSV. You have to convert individual .jsw files to CSV and group them in one folder. What I could get is after converting .jsw files to CSVs you can get data from all the files to one CSV file that assist in data analysis. The code given below will copy the data from all files to one files from 350nm to 200nm with the file name as a header for mdeg and tension (HV).

Steps:

1.Install python (if you do not have already http://www.python.org/getit/)

2.Copy all CSV files to one folder with their names

3. Write the name of CSV in one text file and save it as file_name.txt in the same folder as your data and code

    a.You can do this by Get to the MS-DOS prompt or the Windows command line. Navigate to the directory you wish to print the contents of. If you're new to the command line, familiarize yourself with the cd command and the dir command. Once in the directory you wish to print the contents of, type this command: dir /b > file_name.txt

    b.Open the new file created with name file_name.txt on the same folder and check for the file names and if file_name.txt is also there remove it so that you only have file names listed on the text file.

4.Copy the code given below in notepad and save it as .py file (it’s a python code) in the same folder

5.Right click on the python file and Run this code on python IDLE (press F5)

6.You will get a result file with name final_file.txt. It will be a CSV files with your data for mdeg and HV shorted from 350nm to 200nm, open it with excel. You can make changes in the code to suit your needs like if you are taking data from 200nm to 260 nm, make relevant change in the python code by changing x=range(151) to x=range(61) and then outfile.write(str(350-j)) to outfile.write(str(260-j)) respectively.

7.Hope that helps, thank Rhishikesh Bargaje (he wrote code for me) if it works, write me back if you face some problem, I can try to help.



Code:



infile = open('file_name.txt','r')

s = infile.read().split('\n')

infile.close()



outfile = open('final_file.txt','w')

outfile.write('Wavelength')



for k in s:

    for w in range(2):

        if w == 0:

            outfile.write('\t' + k.replace('.csv','').replace(' ','_') + '_mdeg')

        if w == 1:

            outfile.write('\t' + k.replace('.csv','').replace(' ','_') + '_HV')

      

outfile.write('\n')

      

x = range(151)



for j in x:

    outfile.write(str(350-j))

    for i in s:

        infile = open(i,'r')

        t = infile.read().split('XYDATA\n')

        infile.close()

        data1 = t[1].split('\n\n')[0].split('\n')[j].split(',')[1]

        data2 = t[1].split('\n\n')[0].split('\n')[j].split(',')[2]      

        outfile.write('\t' + data1 + '\t' + data2)

    outfile.write('\n')

outfile.close()


##end of the code##

Alternatively, if you are acquainted with R (Download R if you haven't http://cran.r-project.org/, you can use following script to run it on R for the same result with temperature range for thermal melt from 10 degrees to 70 degrees, edit the code to customize for your use, if needed, remember that you do not have to have directory name printed for this R code and it may not work properly if there are other files in the data folder. Get acquainted with R. Thank Shrikant if you find it useful.

Code:

 ##Start of the code##

CSV_Files=list.files(path=".",pattern="\\.csv",full.names=FALSE);
ResultantMatrix=matrix(nrow=151);
ResultantMatrix[,1]=c(350:200);
for(i in 1:length(CSV_Files))
{
    Current_File=read.table(CSV_Files[[i]],header=FALSE,blank.lines.skip=FALSE);
    tempM=matrix(nrow=151,ncol=2);
    k=1;
    for(j in 21:171)
    {
        temp=strsplit(as.character(Current_File[j,1]),split=",");
        tempM[k,1]=temp[[1]][2];
        tempM[k,2]=temp[[1]][3];
        k=k+1;
       
    }
    t=as.numeric(gsub(".*(\\d+.+?)\\.csv","\\1",CSV_Files[[i]]))+9;
    colnames(tempM)=c(t,t);
    ResultantMatrix=cbind(ResultantMatrix,tempM);
   
}

write.csv(ResultantMatrix,file="Result.csv");

##End of the code##

Sunday, December 4, 2011

Protein-Protein Docking Servers

I was looking for protein-protein docking servers to use in my study, here is the list of online servers that are commonly used and are popular. There are other software giving good result for protein-protein docking, I have not listed them here as I am still trying to compile and I would put it here as soon as I am done with the list. Have fun. 

ClusPro: (http://nrc.bu.edu/cluster) represents the first fully automated, web-based program for the computational docking of protein structures. Users may upload the coordinate files of two protein structures through ClusPro's web interface, or enter the PDB codes of the respective structures, which ClusPro will then download from the PDB server (http://www.rcsb.org/pdb/). The docking algorithms evaluate billions of putative complexes, retaining a preset number with favorable surface complementarities. A filtering method is then applied to this set of structures, selecting those with good electrostatic and desolvation free energies for further clustering. The program output is a short list of putative complexes ranked according to their clustering properties, which is automatically sent back to the user via email.

RosettaDock: The RosettaDock protein-protein docking server predicts the structure of protein complexes given the structures of the individual components and an approximate binding orientation. The server uses the Rosetta 2.1 protein structure modeling suite. The RosettaDock server (http://rosettadock.graylab.jhu.edu) identifies low-energy conformations of a protein–protein interaction near a given starting configuration by optimizing rigid-body orientation and side-chain conformations. The server requires two protein structures as inputs and a starting location for the search. RosettaDock generates 1000 independent structures, and the server returns pictures, coordinate files and detailed scoring information for the 10 top-scoring models. A plot of the total energy of each of the 1000 models created shows the presence or absence of an energetic binding funnel. RosettaDock has been validated on the docking benchmark set and through the Critical Assessment of PRedicted Interactions blind prediction challenge.

ZDOCK, RDOCK: ZDOCK uses a fast Fourier transform to search all possible binding modes for the proteins, evaluating based on shape complementarity, desolvation energy, and electrostatics. The top 2000 predictions from ZDOCK are then given to RDOCK where they are minimized by CHARMM to improve the energies and eliminate clashes, and then the electrostatic and desolvation energies are recomputed by RDOCK (in a more detailed fashion than the calculations performed by ZDOCK). We then tested these programs with a benchmark of 49 non-redundant unbound test cases, where we identified a near-native structure (within 2.5 angstrom from the experimental structure) as the top prediction for 37% of the test cases, and within the top 4 predictions for 49% of the test cases. The superior performance of ZDOCK and RDOCK has also been demonstrated in a community-wide protein docking blind test, CAPRI. Check this out for more details. All software, as well as the benchmark is freely available to academic users. For basic information on running ZDOCK, see this site.
 
GPU.proton.DOCK: (Genuine Protein Ultrafast proton equilibria consistent DOCKing) is a state of the art service for in silico prediction of protein–protein interactions via rigorous and ultrafast docking code. It is unique in providing stringent account of electrostatic interactions self-consistency and proton equilibria mutual effects of docking partners. GPU.proton.DOCK is the first server offering such a crucial supplement to protein docking algorithms—a step toward more reliable and high accuracy docking results. The code (especially the Fast Fourier Transform bottleneck and electrostatic fields computation) is parallelized to run on a GPU supercomputer. The high performance will be of use for large-scale structural bioinformatics and systems biology projects, thus bridging physics of the interactions with analysis of molecular networks. We propose workflows for exploring in silico charge mutagenesis effects. Special emphasis is given to the interface-intuitive and user-friendly. The input is comprised of the atomic coordinate files in PDB format. The advanced user is provided with a special input section for addition of non-polypeptide charges, extra ionogenic groups with intrinsic pKa values or fixed ions. The output is comprised of docked complexes in PDB format as well as interactive visualization in a molecular viewer. GPU.proton.DOCK server can be accessed at http://gpudock.orgchm.bas.bg/.

GRAMM-X: Protein docking software GRAMM-X and its web interface (http://vakser.bioinformatics.ku.edu/resources/gramm/grammx) extend the original GRAMM Fast Fourier Transformation methodology by employing smoothed potentials, refinement stage, and knowledge-based scoring. The web server frees users from complex installation of database-dependent parallel software and maintaining large hardware resources needed for protein docking simulations. Docking problems submitted to GRAMM-X server are processed by a 320 processor Linux cluster. The server was extensively tested by benchmarking, several months of public use, and participation in the CAPRI server track.

HexServer: HexServer (http://hexserver.loria.fr/) is the first Fourier transform (FFT)-based protein docking server to be powered by graphics processors. Using two graphics processors simultaneously, a typical 6D docking run takes 15 s, which is up to two orders of magnitude faster than conventional FFT-based docking approaches using comparable resolution and scoring functions. The server requires two protein structures in PDB format to be uploaded, and it produces a ranked list of up to 1000 docking predictions. Knowledge of one or both protein binding sites may be used to focus and shorten the calculation when such information is available. The first 20 predictions may be accessed individually, and a single file of all predicted orientations may be downloaded as a compressed multi-model PDB file. The server is publicly available and does not require any registration or identification by the user.

3D-Garden: a system for modelling protein–protein complexes based on conformational refinement of ensembles generated with the marching cubes algorithm. 3DGarden is an integrated software suite for performing protein-protein and protein-polynucleotide docking. For any pair of biomolecules structures specified by the user, 3DGarden's primary function is to generate an ensemble of putative complexed structures and rank them. The highest-ranking candidates constitute predictions for the structure of the complex. 3DGarden cannot be used to decide whether or not a particular pair of biomolecules interacts. Complexes of protein and nucleic acid chains can also be specified as individual interactors for docking purposes.

Wednesday, November 23, 2011

Folder list to text file, text file to folders

How to make folder with name from test file ?

You could do this:
1. Make sure all your entries are in column A of your spreadsheet.
2. Edit/copy column A
3. Click Start / Run / notepad c:\folders.txt {OK}
4. Click Edit / paste. You now have a text file with all the folder names
inside.
5. Click Start / run / cmd {OK}
6. Type this test command:
for /F "tokens=*" %* in (c:\folders.txt) do @echo md "D:\My Folders\%*"
{Enter}

If you're happy with the result, make it happen by typing this command:
for /F "tokens=*" %* in (c:\folders.txt) do @md "D:\My Folders\%*"
{Enter}



How do I print a listing of files in a directory?

    Get to the MS-DOS prompt or the Windows command line.
    Navigate to the directory you wish to print the contents of. If you're new to the command line, familiarize yourself with the cd command and the dir command.
    Once in the directory you wish to print the contents of, type one of the below commands.

    dir > print.txt

    The above command will take the list of all the files and all of the information about the files, including size, modified date, etc., and send that output to the print.txt file in the current directory.

    dir /b > print.txt

    This command would print only the file names and not the file information of the files in the current directory.

    dir /s /b > print.txt

    This command would print only the file names of the files in the current directory and any other files in the directories in the current directory.

    After doing any of the above steps the print.txt file is created. Open this file in any text editor (e.g. Notepad) and print the file. You can also do this from the command prompt by typing notepad print.txt.

Saturday, November 5, 2011

In-silico characterization of proteins

BLAST : In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. The BLAST program was designed by Eugene Myers, Stephen Altschul, Warren Gish, David J. Lipman, and Webb Miller at the NIH and was published in the Journal of Molecular Biology in 1990

CDD search: Conserved Domain Database (CDD) is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domains, which use 3D-structure information to explicitly to define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM).

PFAM: The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. The identification of domains that occur within proteins can therefore provide insights into their function. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families. Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence database, in order to give a more comprehensive coverage of known proteins we also generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM.

TMHMM: A variety of tools are available to predict the topology of transmembrane proteins. To date no independent evaluation of the performance of these tools has been published. A better understanding of the strengths and weaknesses of the different tools would guide both the biologist and the bioinformatician to make better predictions of membrane protein topology.

SignalP: SignalP 4.0 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks. 

STRING: STRING is a database of known and predicted protein interactions. The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources i.e. Genomic context, high throughput experiments, coexpression, previous knowledge. STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently covers 5'214'234 proteins from 1133 organisms.

PROTPARAM: ProtParam (References / Documentation) is a tool which allows the computation of various physical and chemical parameters for a given protein stored in Swiss-Prot or TrEMBL or for a user entered sequence. The computed parameters include the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY)

PROSITE: Search your query sequence for protein motifs, rapidly compare your query protein sequence against all patterns stored in the PROSITE pattern database and determine what the function of an uncharacterised protein is. This tool requires a protein sequence as input, but DNA/RNA may be translated into a protein sequence using transeq and then queried.

InterPro: InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures.

GlobPlot Webservice:

Prediction of disorder:

  • DisEMBL - DisEMBL is our neural network based predictor.
  • DISOPRED - Predictor from David Jones' lab.

Function prediction in non-globular protein space:

  • ELM - The Eukaryotic Linear Motif Resource.
  • NetworKIN - Systematic Discovery of In Vivo Phosphorylation Networks.

Thesis on disorder and linear motifs

Function prediction in globular protein space:

  • SMART - SMART/Pfam domains

Domain boundaries:

  • DomCut - A domain boundary detector
  • DomPred - Domain predictor from David Jones' lab.

Synthetic Biology

Synthetic Biology Project @ SLRI - Applying GlobPlot.


Resources

Subcellular localization predictors:
Subcellular localization databases: