TRF¶
Tandem Repeats Finder [1] (link)
version 4.07b
Program parameters¶
$ /exports/software/trf/trf-4.07b/trf
Tandem Repeats Finder, Version 4.07b
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.
Please use: /exports/software/trf/trf-4.07b/trf File Match Mismatch Delta PM PI Minscore MaxPeriod [options]
Where: (all weights, penalties, and scores are positive)
File = sequences input file
Match = matching weight
Mismatch = mismatching penalty
Delta = indel penalty
PM = match probability (whole number)
PI = indel probability (whole number)
Minscore = minimum alignment score to report
MaxPeriod = maximum period size to report
[options] = one or more of the following
-m masked sequence file
-f flanking sequence
-d data file
-h suppress html output
-r no redundancy elimination
-ngs more compact .dat output on multisequence files, returns 0 on success. You may pipe input in with this option using - for file name. Short 50 flanks are appended to .dat output. See more information on TRF Unix Help web page.
Recommended usage (plus html output suppression)
trf yoursequence.txt 2 7 7 80 10 50 500 -f -d -m -h
Program outputs¶
Outputs are written to input directory, using input filename suffixed by parameter values and file type
infile.2.7.7.80.10.50.500.mask¶
Hard-masked fasta format sequence file (option -m
):
>seq1
GTTGTTTTTCTGAGACCGAAAAGGCTTGTTTTTATGTAGCAAGTTCAAGAACTGGTTTCG
GTATCGCAGGGTATGTATGGTCTGAGACGTATAACTGTCACGANNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNTATGATGAATGGGTAATTTTTAAAGTATCTCACATAACTTTTAGATG
TAGTAAGACCATGTTTGCGAATCGAATCTCTTACTGGTTGAACGCCAAAGGACCTTCCAT
...
>seq2
...
infile.2.7.7.80.10.50.500.dat¶
Text file with space delimited blocks (option -d
):
Tandem Repeats Finder Program written by:
Gary Benson
Program in Bioinformatics
Boston University
Version 4.07b
Sequence: seq1
Parameters: 2 7 7 80 10 50 500
3404 3433 14 2.1 14 93 0 51 30 16 30 23 1.96 AGCAGCACTTGAGT AGCAGTACTTGAGTAGCAGCACTTGAGTAG
6344 6383 18 2.2 18 95 0 71 45 2 10 42 1.51 ATTATTATAAGGAATTAC ATTATTATAAGGAATTATATTATTATAAGGAATTACATTA
...
6237875 6237915 16 2.6 16 96 3 75 0 4 48 46 1.23 TGTGTGTGTGTGCGTG TGTGTGTGTGTGCGTGTGTGTGTGTGTGCGTGTGTTGTGTG
Sequence: seq2
Parameters: 2 7 7 80 10 50 500
15030 15056 14 1.9 14 100 0 54 66 0 0 33 0.92 TTAAATAAATAAAT TTAAATAAATAAATTTAAATAAATAAA
...
Each block contains:
cols 1+2: Indices of the repeat relative to the start of the sequence
col 3: Period size of the repeat
col 4: Number of copies aligned with the consensus pattern
col 5: Size of consensus pattern (may differ slightly from the period size)
col 6: Percent of matches between adjacent copies overall
col 7: Percent of indels between adjacent copies overall
col 8: Alignment score
cols 9-12: Percent composition for each of the four nucleotides
col 13: Entropy measure based on percent composition
col 14: Consensus sequence
col 15: Repeat sequence
Grid engine wrapper¶
trf.sh:
#!/bin/bash
#$ -V
#$ -cwd
#$ -j y
#$ -o $JOB_ID.log
# define $DATADIR and $SEQFILE before/when calling this script, e.g.:
# export DATADIR=`pwd`
# qsub -v SEQFILE=seq.fa ...
WORKDIR=/scratch/$USER/trf/$JOB_ID
mkdir -p $WORKDIR
cp $DATADIR/$SEQFILE $WORKDIR
cd $WORKDIR
nice -n 10 /exports/software/trf/trf-4.07b/trf $SEQFILE 2 7 7 80 10 50 500 -f -d -m -h
rsync -a --remove-source-files ./${SEQFILE}* $DATADIR
qsub command¶
export DATADIR=`pwd`
qsub -v SEQFILE=seq.fa /path/to/trf.sh
parallel command¶
Split large sequence file into blocks with pyfasta, process each block
using parallel to submit to gridengine with one slot per job,
then combine outputs into single .mask
and .dat
files.
pip install pyfasta
pip install numpy
export DATADIR=`pwd`
export INFILE=seq # OMIT EXTENSION!
export SCRIPTDIR=~
pyfasta split -n 32 $INFILE
parallel -j 32 qsub -q nice.q -v SEQFILE={} -sync y $SCRIPTDIR/trf.sh ::: $INFILE.??.fa
cat $INFILE.??.fa.*.mask > $INFILE.fa.mask
cat $INFILE.??.fa.*.dat > $INFILE.fa.dat
rm *.log
rm $INFILE.??.fa.*.mask
rm $INFILE.??.fa.*.dat
rm $INFILE.??.fa