Chinaunix首页 | 论坛 | 博客
  • 博客访问: 468253
  • 博文数量: 118
  • 博客积分: 4015
  • 博客等级: 上校
  • 技术积分: 1233
  • 用 户 组: 普通用户
  • 注册时间: 2010-11-24 22:11
文章分类

全部博文(118)

文章存档

2013年(5)

2011年(61)

2010年(52)

分类:

2010-12-03 17:23:50

需求:师兄有10个文件要处理,需求是这样的,1个文件名为new*,其余9个为last*,要从new*中取出每一条记录,到9个last*中比对,如果记录有交集那么,需要将比对成功的结果保存在另一个文件中,保存的信息,主要有几个字段,还有一个相似度,需要计算出来,就是这样。。。
难点是:判断两个区间有没有交集:
点A(LA,HA),点B(LB,HB)
比较好想的逻辑是这样子的:
if(LA > HB){
   NO Overlap;
}elsif(HA < LB){
   NO Overlap;
}else{
 Overlapped;
 if(LA < LB){
    if(HA < HB){
       Len_Overlap = HA - LB;
    }else{
       Len_Overlap = HB - LB;
    }
 }else{
  if(HA < HB){
       Len_Overlap = HA - LA;
  }else{
       Len_Overlap = HB - LA;
  }
 }
}

源代码如下:

#!/usr/bin/perl
#Name: Transcript Similarity Calc
#Author: Chaoyong Xie
my $newfile = $ARGV[0];
my $lastfile = $ARGV[1];
if(defined($newfile) && defined($lastfile)){
  open(NEWFH,$newfile) or die "Cannot read $newfile:$!";
  open(LASTFH,$lastfile) or die "Cannot read $lastfile:$!";
  my @records_new = ; chomp @records_new;
  my @records_last = ; chomp @records_last;
  #print "Both files Loaded!\n";
  close(NEWFH) or die "Cannot close $newfile:$!";
  close(LASTFH) or die "Cannot close $lastfile:$!";
  my @fields_new = ();
  my @fields_last = ();
  foreach $record_new (@records_new){
    @fields_new = split(/\t/,$record_new);
    foreach $record_last (@records_last){
      @fields_last = split(/\t/,$record_last);
      if($fields_new[0] eq  $fields_last[0] && $fields_new[5] eq $fields_last[5]){
        my @offset_new = split(/,/,$fields_new[11]);
        my @length_new = split(/,/,$fields_new[10]);
        my @offset_last = split(/,/,$fields_last[11]);
        my @length_last = split(/,/,$fields_last[10]);
        my $len_overlap = 0;
        my $len_sum = 0;
        for(my $i = 0; $i < $fields_new[9]; $i ++){
          for(my $j = 0; $j < $fields_last[9]; $j ++){
            if($fields_last[1] + $offset_last[$j] > $fields_new[1] + $offset_new[$i] + $length_new[$i]){
               next;
            }else{
               if($fields_last[1] + $offset_last[$j] + $length_last[$j] < $fields_new[1] + $offset_new[$i]){
                  next;
               }else{
                  if($fields_last[1] + $offset_last[$j] < $fields_new[1] + $offset_new[$i] ){
                     if($fields_last[1] + $offset_last[$j] + $length_last[$j] < $fields_new[1] + $offset_new[$i] + $length_new[$i]){
                         $len_overlap += $fields_last[1] + $offset_last[$j] + $length_last[$j] - $fields_new[1] -$offset_new[$i];
                     }else{
                         $len_overlap += $length_new[$i];
                     }
                  }else{
                     if($fields_last[1] + $offset_last[$j] + $length_last[$j] < $fields_new[1] + $offset_new[$i] + $length_new[$i]){
                         $len_overlap += $length_last[$j];
                     }else{
                         $len_overlap += $fields_new[1] + $offset_new[$i] + $length_new[$i] - $fields_last[1] - $offset_last[$j];
                     }
                  }
               }
            }
          }
        }

        if($len_overlap > 0){
           $len_sum = 0;
           for(my $t = 0; $t < $fields_new[9]; $t++){
              $len_sum += $length_new[$t];
           }
           my $similarity = 0;
           $similarity = $len_overlap / ($len_sum + 0.0);
           print "$fields_new[0]\t$fields_new[1]\t$fields_new[2]\t$fields_new[3]\t",substr($lastfile,-13,9),"\t$fields_last[3]\t$similarity\n";
        }
      }else{
         next;
      }

    }

  }
}else{
  die "There must be 2 parameters!\n";
}

阅读(2051) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~