#识别字符串中所有的重复片段(重复模式),字符串由A,C,G,T组成,字符串最长为10000,随机产生。重复的片段最小是10个符串。输出:
位置,大小,和片段。如下:
#String:
#TAAAAACTCGGGGT AAAAACTCGGGGAAAA
#Repeat:
#Repeat: AAAAACTCGGGG, Size: 12, Start Positions: 2, 15
#解释:如这就就有两个重复(空格格开了):T AAAAACTCGGGG T AAAAACTCGGGG AAAA
#这两个重复位置分别在字符串的2和15位,大小为12
#识别所有的反向重复,如TAACCG =>; GCCAAT,也就是前面的反过来就是后面的字符串。和上面一样,字符串最长为10000,随机产生,最小反
向重复片段为10,输出位置,大小,和片段,如下:
#String:
#CAAAAACGAGGGGTTTGGGGAGCAAAAA
#Inverted Repeat:
#Inverted Repeat: AAAAACGAGGGG, Size: 12, Start Positions: 17, 2
#解释:如上面,C AAAAACGAGGGG TTT GGGGAGCAAAAA
#AAAAACGAGGGG和GGGGAGCAAAAA分别是反向重复,分别在2和17位上,大小为12。
#1) 查找最长重复匹配串,先到先得
#2) 匹配串不能重叠
#3) 每个匹配串必须包括全部 ACGT 四个字符。(这是才是有效的基因片断),所以 AAAAAAAAAAAAAAAAAAAAAAAAAAAA 不算重复匹配串
#测试 >= 10 的就可以了.
#为什么重复的是AAAAACTCGGGG
#而不是TAAAAACTCGGGG
#T怎么没算进去?
#把>10的且有匹配的都输出来,所以对于这种序列来说,就是从10开始以1长步长增加,所以输出的重复序列彼此重叠。就拿我的数据AAAATTTT
CCCCGGGGAAAATTTTCCCCGGGG来说,实际重复的应该只是一条序列,那就是AAAATTTTCCCCGGGG,因为这条是最大的,且没有和他重复的序列。
#我当初为了大家看起来方便自己改的例.
#关于位置的重复, 有时和你的取舍有关:
#AAACCCGGGTTTATTAAACCCGGGTGGGCCCGGGTTTA
#有两个符合条件的:
#AAACCCGGGT Length=10 Position=1;16
#CCCGGGTTTA Length=10 Position=4;29
#你选第一个,先到先得?
#匹配可重复吗?一个字段已经在前面被匹配过了,还可作为新字段的一部分被匹配?答案:不能!
#!/bin/awk -f
#
# A script can be used to check any repeat pieces of nucleotide sequences.
#
# Repeat Match Usage:: $0 datafile
# Reverted Repeat Match Usage:: $0 -v r=1 datafile
#
# function is_overlap : check if a string (position=p, length=l) is overlap with
# matched strings which are stored in array record (position=i, length=record[i])
function is_overlap(p, l) {
e = p + l - 1
for (i in record) {
a = i + record[i] - 1
if (( i >= p && i <= e ) || ( a >= p && a <= e ) || ( p >= i && p <= a ) || ( e >= i && e <= a ))
return 1
}
return 0
}
# flag=0 find the longest matched string and set STR_MAX
# flag=1 find all other matched strings
function find_string(STR_MIN, STR_MAX, flag) {
for (i in record)
delete record[i]
if ( flag == 1 ) {
if ( r == 1 )
print "---------------Reverted Repeat Match Line# "NR" -----------------\n"
else
print "------------------Repeat Match Line# "NR" --------------------\n"
}
for ( Str_Len=STR_MAX; Str_Len >= STR_MIN; Str_Len -- ) {
for ( Position=1; Position <= L - 2 * Str_Len + 1; Position ++ ) {
if ( is_overlap(Position,Str_Len) == 1 )
continue
count=0
pos=Position
offset=Position + Str_Len - 1
left=substr($0,Position,Str_Len)
if (index(left,"A")==0 || index(left,"C")==0 || index(left,"G")==0 || index(left,"T")==0 )
continue
right=substr($0, Position + Str_Len)
# Reverse string left. The start position in reverse string is L -Position - Str_Len + 2
if ( r == 1 ) {
old_left=left
left=substr(REVERSE_STRING, L - Position - Str_Len + 2, Str_Len)
}
while ( Str_Len <= length(right) ) {
i=index(right,left)
if ( i > 0 ) {
if ( flag == 0 )
return 1
j=offset + i
if ( is_overlap(j,Str_Len) == 0 ) {
count ++
record[Position]=Str_Len
record[j]=Str_Len
pos=pos","j
}
right=substr(right, i + Str_Len)
offset+=(i + Str_Len - 1)
}
else
break
}
if (count > 0 && flag == 1) {
match_number[Str_Len] ++
if (r == 1)
print "Reverted Repeat: " old_left",", "Size: "Str_Len",", "Start Positions: "pos
else
print "Repeat: " left",", "Size: "Str_Len",", "Start Positions: "pos
}
}
}
if ( flag == 0 )
return 0
}
{
L=length($0)
if ( r == 1 ) {
REVERSE_STRING=""
# split($0, aaa, "") : Use null string "" to split every characters into array aaa
split($0, aaa, "")
for (i=L; i >= 1; i -- ) {
REVERSE_STRING=REVERSE_STRING""aaa[i]
delete aaa[i]
}
}
STR_MIN=10
STR_MAX=0
# Find max matched string and set STR_MAX by binary search algorithm
low=STR_MIN - 1
high=int(L / 2) + 2
while ( low < high ) {
if ( (high - low) == 1 ) {
if ( low >= STR_MIN )
STR_MAX=low
break
}
mid=int((high + low) / 2)
status=find_string(mid, mid, 0)
if (status)
low=mid
else
high=mid
}
if ( STR_MAX >= STR_MIN )
find_string(STR_MIN, STR_MAX, 1)
}
=====================================================
以上来自论坛牛人lightspeed写的,我还没认真分析过,这个关键要看懂问题。呵呵,先贴着留以后分析之。
原帖
Lightspeed肯定是做开发的,可惜这牛人不来论坛了,o(╯□╰)o。只能仰望之!