CLUSTALW HELP

How to input

  
  • What is calculated?
  •   
  • Select sequence type
  •   
  • Do full multiple alignment (ALIGN options)
  •   
  • Use Quicktree (QUICKTREE options)
  •   
  • Calculate NJ tree (TREE options)
  •   
  • Bootstrap a NJ tree (BOOTSTRAP options)

  • Examples of Sequence Data Format


    How to get the result

      
  • WWW/E-mail

  • What is calculated?
      
    align tree bootstrap calculate
    ON OFF OFF do full multiple alignment only
    OFF ON OFF calculate NJ tree only
    OFF OFF ON bootstrap a NJ tree only
    ON ON OFF multiple alignment and tree continuously
    ON ON ON multiple alignment, tree and bootstrap continuously

    detailed options

  • Select sequence type
  •   
    AUTOMATIC Sequences type is auto-detected. All option values are default.
    DNA DNA analysis
    PROTEIN PROTEIN analysis

  • Do full multiple alignment
  •    ALIGN "ON" and specification of each ALIGN option are effective when DNA /PROTEIN is selected at TYPE option. Each option is explained as follows (Default values are shown in the green color).
     
       Default value of each option
         
    ALIGN
    OUTPUT clustal
    OUTORDER aligned
    MATRIX blosum
    SEQNO_RANGE OFF
    GAPOPEN DNA->15.0, PROTEIN->10.0
    GAPEXT DNA->6.66, PROTEIN->0.2
    GAPDIST 8
    MAXDIV 40
    ENDGAPS OFF
    NOPGAPS OFF
    NOHGAPS OFF
    DOTSINOUTPUT OFF
    QUICKTREE OFF
    PWMATRIX blosum
    PWGAPOPEN DNA->15, PROTEIN->10
    PWGAPEXT DNA->6.66, PROTEIN->0.1
    TREE
    DISTANCE Kimura
    TOSSGAPS ON
    OUTPUTTREE phylip

      ALIGN
         
    Specify to execute the alignment. When ON is specified, alignment is executed.
     
      SHOW ALIGNMENT SCORE
         
    Check the box, if you want to know the alignment score. Default is blank (not checked). And in the default case, the following message is appeared in the output result.
    "== Aligned score is not displayed =="
     
      OUTPUT
         
    multiple alignment output format. Default is clustal.
    clustal standard
    fasta FASTA format
    gcg GCG format
    gde GDE format
    phylip PHYLIP format
    pir PIR format
     
      OUTORDER
         
    Specify the order in which the sequences should be printed in the alignment. Default is aligned.
    aligned printing by aligned order
    input printing by input order
     
      MATRIX
         
    Specify the scoring matrix which describes the similarity between all possible pairs of amino acids. The default value is blosum. This option is effective only when amino acid sequences are aligned. (When the input data are recognized as nucleotide sequences, your selection is ignored.)
    blosum These matrices appear to be the best available for carrying out data base similarity (homology searches).
    pam These have been extremely widely used since the late '70s. They are also called Dayhoff's matrix.
    id This matrix gives a score of 1.0 to two identical amino acids and a score of zero otherwise.
     
      SEQNO_RANGE
         
    Specify the region where the multiple alignment is executed. Default is OFF (not executed).
    *1. This option is effective in case of both "START" and "LENGTH" are specified.
    *2. Do not attach "/" to the comment line.
     
          SHOW SEQNO_RANGE SCORE
            
    Check the box to view the SEQNO_RANGE option score. Default is a blank (not checked), and in this case, the following message is displayed.
    "== SEQNO_RANGE score is not displayed =="
     
          START
            
    Specify the start position of alignment. Range: 1 or more
    *This option is effective in case of both SEQNO_RANGE "ON" and "LENGTH" are specified.
     
          LENGTH
            
    Specify the sequence region of alignment from the start position (START).
    *This option is effective in case of both SEQNO_RANGE "ON" and "START" are specified.
     
      GAPOPEN
         
    Gap opening penalty.
    DNA:15.0 Range:0.0-100.0.
    PROTEIN:10.0 Range:0.0-100.0.
     
      GAPEXT
         
    Gap extension penalty.
    DNA:6.66 Range:0.0-10.0.
    PROTEIN:0.2 Range:0.0-10.0.
     
      GAPDIST
         
    Gap separation distance. Default:8 Range:0-100.
     
      MAXDIV
         
    If the identity of the sequence against all anothrer sequences is lower than the specified value, the sequence is ordered later in the multiple alignment. Default:40 Range:0-100
     
      ENDGAPS
         
    Select End gap separation ON (ignore) or OFF (not ignore). Default is OFF (not ignore).
     
      NOPGAPS
         
    Select Pascarella gaps ON/OFF. Default is OFF (not ignore).
     
      NOHGAPS
         
    Select hydrophilic gaps ON/OFF. Default is OFF (not ignore).
     
      DOTSINOUTPUT (DDBJ original option)
         
    The DOTSINOUTPUT option is available for ALIGNMENT only, and unavailable together with the "TREE" and/or "BOOTSTRAP" options.
    Use dots in output. Default is OFF.
         
    example :
    CLUSTAL W (1.81) multiple sequence alignment
    
    A1-1_A101      GGCCGACCCTTCGGCCCGGGGGCC
    A1-2_A102      ......T.................
    A1-3_A103      NNNN..T.T...............
    A1-4_A104      NNNN.G..................
    A2             NNNN..T................-
    AX             NNNN......A.............
    A3-1           NNNN................A...
    cis-AB         NNNN..T...........C.....
    O-1_O101       ....-...................
    O-2_O201       TATT-G....A.A..T...A....
    O-3            NNNN.G.G.........A......
    O-4_O102       NNNN-....C..............
    O-5_O103       NNNN-G..................
    O-6_O202       NNNN-.....A.A..T...A....
    O-7_O203       NNNN-G....A.A.TT...A....
    B-1_B101       .....G.G...T.A..A.C..A..
    B-2_B102       NNNN.G.G...T.A..A.C.....
    B-3_B103       NNNN.G.G.....A..A.C..A..
    B(A)           NNNN.G.G........A.C..A..
    B3-1           NNNN.G.G...T.A..A.C..AT.
                   ************************
    
     
      
  • QUICKTREE
  •       Select to use Fast Pairwise Alignment or not. Default is OFF (not used). Effective options differs depending on QUICKTREE "ON" or "OFF". Each options and dafault values are as follows.
         
    QUICKTREE "OFF"
    PWMATRIX blosum
    PWGAPOPEN 10
    PWGAPEXT 0.1
    QUICKTREE "ON"
    KTUPLE 1
    WINDOW 5
    SCORE percent
    TOPDIAGS 5
    PAIRGAP 3
     
          PWMATRIX
             Specify the scoring matrix when SLOW algorithm is chosen for making an alignment. This parameter is used only for making pairwise alignments which are necessary to estimate distances between pairs of amino acid sequences and construct a guide tree. So, this option is effective only when amino acid sequences are aligned, and iggnored when the input data are recognized as nucleotide sequences. The default value is blosum.
            
    blosum These matrices appear to be the best available for carrying out data base similarity (homology searches).
    pam These have been extremely widely used since the late '70s. They are also called Dayhoff's matrix.
    id This matrix gives a score of 1.0 to two identical amino acids and a score of zero otherwise.
     
          PWGAPOPEN
             Specify gap opening penalty. Default (DNA):6.66 Range:0.0-10.0. Default (PROTEIN):0.1 Range:0.0-10.0.
     
          PWGAPEXT
             Specify gap extension penalty. Default (DNA):6.66 Range:0.0-10.0. Default (PROTEIN):0.1 Range:0.0-10.0.
     
          KTUPLE
             Specify word size. Default (DNA):2 Range:1-4. Default (PROTEIN):1 Range:1-2.
     
          WINDOW
             Specify window around best diags. Default (DNA):4 Range:1-50. Default (PROTEIN):5 Range:1-50.
     
          SCORE
          Specify PERCENT or ABSOLUTE. Defalut:percent
     
          TOPDIAGS
             Specify number of best diags. Default (DNA):4 Range:1-50. Default (PROTEIN):5 Range:1-50.
     
          PAIRGAP
             Specify gap penalty. Default (DNA):5 Range:1-500. Default (PROTEIN):3 Range:1-500.


  • Calculate NJ tree
  •    The following options are effective at TREE->ON.
     
      TREE
         
    Select whether phylogenetic tree calculation by NJ method is executed or not, and specify the output format options . Default is ON.
         
    DISTANCE Kimura
    TOSSGAPS ON
    OUTPUTTREE phylip
     
      DISTANCE (DDBJ original option)
         
    Specify the correction format. Default value is Kimura. Only Kimura and p-distance can be specified for PROTEIN. The pink color boxes are DDBJ's original option formats.
     
    Method of phylogenetic tree

    Methods for constructing the phylogenetic tree using the nucleotide or amino acid sequences may largely be classified into the distance-matrix methods and the character-state methods. In the distance-matrix method, the distance matrix, which consists of evolutionary distances (number of nucleotide or amino acid substitutions) between all possible pairs of sequences analyzed, is generated, and the phylogenetic tree fittest to the matrix is chosen. On the other hand, in the character-state method, the sequences are compared directly, and the phylogenetic tree fittest to the assumed pattern of nucleotide or amino acid substitution is chosen

    In CLUSTALW, the phylogenetic tree is constructed by using the neighbor-joining (NJ) method, which belongs to the distance-matrix method. When the nucleotide sequences are analyzed, the p distance method, Kimura method, Tamura method, Tajima-Nei method, Gojobori-Ishii-Nei method, Tamura-Nei method, and so on, are available for estimating the number of nucleotide substitutions between sequences. These methods are different in the pattern (model) of nucleotide substitution assumed for estimating the evolutionary distance.

    Generally, the bases T (U) and C have a pyrimidine, and A and G have a purine in their chemical structure, and the physicochemical properties are similar within each group. In fact, the rates of nucleotide substitution between T and C and between A and G (transitions) are empirically known to be greater than those of the other types of substitutions (transversions). In addition, since the equilibrium frequencies of T, C, A, and G are usually different in a genome, the rate of nucleotide substitution appears to be dependent on the frequency of the base to which the original base is substituting. Another mechanisms are also considered to make the rate of each nucleotide substitution (T -> C, A -> G, etc.) different.

    These arguments suggest that assuming complex patterns of nucleotide substitution allows for accurate estimation of the numbers of nucleotide substitutions. However, the more complex models contain a greater number of parameters to be estimated, and the variances (standard errors) of the estimates become larger as the number of parameters increases. Since the parameter values are estimated from the sequence data analyzed, the accuracy of the estimates depends on the number of sequences, sequence length, and sequence divergence, etc. Therefore, the pattern of nucleotide substitution suitable for the analysis of sequences depends on the sequence data analyzed, and some methods are available for finding the fittest model for given sequence data.

    In CLUSTALW, the default method used for estimating the number of nucleotide substitutions is the Kimura method, because this method is one of the most widely used methods. However, if the fittest model to the sequence data analyzed is different from the Kimura model, it is possible that incorrect results are obtained. In such cases, it may be useful to try another models in the analysis.

    Similarly, the p distance method and Kimura method are available for estimating the number of amino acid substitutions between sequences in CLUSTALW. (Here the Kimura method for estimating the number of amino acid substitutions is totally different from the Kimura method for estimating the number of nucleotide substitutions.) The default method is the Kimura method, but the p distance method may also be useful for some data.

     
    Method Model Note
    p-distance None Proportion of difference
    Kimura (Kimura-2-parameter)
     TCAG
    T-αββ
    Cα-ββ
    Aββ-α
    Gββα-
    Distance estimated by assuming that the rates of transition and transversion are different.
    Jukes-Cantor
     TCAG
    T-ααα
    Cα-αα
    Aαα-α
    Gααα-
    Distance estimated by assuming that all types of substitutions occur at the same rate.
    Tamura
     TCAG
    T-κπGC1-πGCπGC
    Cκ(1-πGC)-1-πGCπGC
    A1-πGCπGC-κπGC
    G1-πGCπGCκ(1-πGC)-
    Distance estimated by assuming that the rates of transition and transversion are different, and taking into account the equilibrium frequencies of GC.
    Tajima-Nei
     TCAG
    T-απCαπAαπG
    CαπT-απAαπG
    AαπTαπC-απG
    GαπTαπCαπA-
    Distance estimated by taking into account the equilibrium frequencies of T, C, A, and G.
    Gojobori-Ishii-Nei
     TCAG
    T-βγβ
    Cα-αδ
    Aεβ-β
    Gαζα-
    Distance estimated by assuming that the rates are different not only for substitutions between GC and TA, but also for others.
    Tamura-Nei
     TCAG
    T-α2πCβπAβπG
    Cα2πT-βπAβπG
    AβπTβπC-α1πG
    GβπTβπCα1πA-
    Distance estimated by assuming not only that the rates of transition and transversion are different but also the rates between TC and AG are different, and taking into account the equilibrium frequencies of T, C, A, and G.
          α,α12,β,γ,δ,ε,ζ,κ: rates of substitution
          πTCAGGC: equilibrium frequencies
     
      TOSSGAPS
         
    Specify ignore positions with gaps. Default is ON.
     
      OUTPUTTREE
         
    Specify the output format (options are phylip, nj and plylip distance). Default is phylip.
    phylip Phylip format tree output.
    nj CLUSTAL format tree output.
    dist Phylip distance matrix output.

  • Bootstrap a NJ tree
  •   BOOTSTRAP
         
    Select whether phylogenetic tree calculation by NJ method is executed or not, and specify the output format options . Default is OFF. The following options are effective at BOOTSTRAP->ON.
     
      DISTANCE (DDBJ original option)
         
    Specify the correction format. Default value is Kimura. Only Kimura and p-distance can be specified for PROTEIN . The pink color boxes are DDBJ's original option formats.
     
    Method of phylogenetic tree
     
    Method Model Note
    p-distance None Proportion of difference
    Kimura (Kimura-2-parameter)
     TCAG
    T-αββ
    Cα-ββ
    Aββ-α
    Gββα-
    Distance estimated by assuming that the rates of transition and transversion are different.
    Jukes-Cantor
     TCAG
    T-ααα
    Cα-αα
    Aαα-α
    Gααα-
    Distance estimated by assuming that all types of substitutions occur at the same rate.
    Tamura
     TCAG
    T-κπGC1-πGCπGC
    Cκ(1-πGC)-1-πGCπGC
    A1-πGCπGC-κπGC
    G1-πGCπGCκ(1-πGC)-
    Distance estimated by assuming that the rates of transition and transversion are different, and taking into account the equilibrium frequencies of GC.
    Tajima-Nei
     TCAG
    T-απCαπAαπG
    CαπT-απAαπG
    AαπTαπC-απG
    GαπTαπCαπA-
    Distance estimated by taking into account the equilibrium frequencies of T, C, A, and G.
    Gojobori-Ishii-Nei
     TCAG
    T-βγβ
    Cα-αδ
    Aεβ-β
    Gαζα-
    Distance estimated by assuming that the rates are different not only for substitutions between GC and TA, but also for others.
    Tamura-Nei
     TCAG
    T-α2πCβπAβπG
    Cα2πT-βπAβπG
    AβπTβπC-α1πG
    GβπTβπCα1πA-
    Distance estimated by assuming not only that the rates of transition and transversion are different but also the rates between TC and AG are different, and taking into account the equilibrium frequencies of T, C, A, and G.
          α,α12,β,γ,δ,ε,ζ,κ: rates of substitution
          πTCAGGC: equilibrium frequencies
     
      TOSSGAPS
         
    Specify ignore positions with gaps. Default is ON.
     
      OUTPUTTREE
         
    Specify the output format (options are phylip and nj). Default is phylip.
    phylip Phylip format tree output.
    nj CLUSTAL format tree output.
     
      COUNT
         
    Specify number of bootstraps. Default:1000 Range:1-10000
     
      SEED
         
    Specify seed number for bootstrap. Defalut:111 Range:1-1000.

    Examples of Sequence Data Format

       In both result options (by WWW or by E-mail), when your query size is too big (a large number of sequences, or each sequence is very long), the result might not be viewed in the web screen normally. In such a case, please reduce the query size to send it at one time, decreasing the number of sequences or shortening the the sequence lengths.

    FASTA format Clustalw uses first 30 characters of sequence titles as sequence names which must be unique.
    >title1
    CGGTGA.....................................
    GAGTAATGGAATG..............................
    >title2
    CTTGATT....................................
    GAGTAATGGAATG..............................
    
     
      
    clustal format OUTPUT OPTION:Default
    CLUSTAL W(1.83) multiple sequence alignment --> necessary
                                          > leave more than one blank line
                                          >
    title-a -CAAAGTCATATTTCA...................
    title-b -CAAAGTCATATTCCA...................
    title-c -CAAAGTTATAT----...................
             ****** ****
    
    title-a -GTCCTCTGCGTTCCT...................
    title-b -TGGCTCTGGGTTCCG...................
    
     
      
    GCG format OUTPUT OPTION:GCG
    PileUp                              --> necessary
                                          > leave more than one blank line
                                          >
      MSF: 464 Type: N Check: 2031 ..
    
    Name: title-a oo Len: 464 Check: 8529 Weight: 1.00
    Name: title-b oo Len: 464 Check: 3342 Weight: 1.00
    Name: title-c oo Len: 464 Check: 2325 Weight: 1.00
    
    //
    
    
    
    title-a . CAAAGTCAT ATTTTA.................
    title-b . CAAAGTCAT ATTTTA.................
    title-c . CAAAGTCAT ATTTTA.................
    
     
      
    Other Formats (GDE, PIR) are also acceptable if outputs of clustalw are used exactly as they are. Do not attach "/" to the comment line.

    Result

       You can specify the way of obtaining the clustalW output. If you select "WWW", you can see the output on your screen. Or if you specify "E-mail", the result will be sent to your E-mail address.

    How to see the result screen

      
  • CLUSTALW analysis result
  •   
  • TreeView operation screen
  •    CLUSTALW analysis result
        
         
    (1)Your query sequence is automatically recognized by ClustalW as DNA or amino acid (PROTEIN).
     
         
         
    (2)An alignment file can be downloaded by right clicking (Windows) or double clicking (Macintohsh) the file name "query.aln" (circled in red in the above screen figure).The file is transferred as an application/x-align MIME type. If an appropriate application is installed and configured as a helper application, the file can be read automatically.
    And the result of a multiple alignment can be seen by using CINEMA (colored multiple alignment editor), JalView (multiple alignment editor with java).
     
         
         
    (3)An alignment guide tree file can be downloaded by right clicking (Windows) or double clicking (Macintosh) the file name "query.dnd" on the screen. The file is transferred as a biotree/newick MIME type. If TreeView is installed and configured as a helper application, you can view the phylogenetic tree automatically.
     
         
         
    (4)Analysis result file by NJ method can be downloaded by right clicking (Windows) or double clicking (Macintosh) "query.ph" (file name:circled in red on the screen). The file is transferred as an application/x-treeview MIME type. If TreeView is installed as a helper application, you can view the phylogenetic tree.
     
         
         
    (5)Bootstrap analysis result file by NJ method can be downloaded by right clicking (Windows) or double clicking (Macintosh) "query.phb". The file is transferred as an application/x-treeview MIME type. You can view the phylogenetic tree by opening the file with TreeView.
     
       TreeView operation screen
          This is the screen of the phylogenetic tree viewer "TreeView". TreeView (Windows version or Macintosh version) should had been installed your computer. About usage or manual of TreeView, please refer to the site of the TreeView.
         
         
    (6)If you start the TreeView application, select "Open it with the treev32.exe".
     
         
         
    (7)The above screen is a result which is obtained when TreeView application was performed. About usage or manual of TreeView, please refer to the site of the TreeView.
     


    Last updated: Jan. 06, 2012
    Contact Us