1.从数据库里面导出科瓦齿科的点击的全部url:
select t.parenturl
from tstatisticscount_new t
where t.sourceurl = 'http://***.***.71.228/ad/fy/kewacike.html'
and t.parenturl is not null and to_char(t.daddtime, 'YYYY-MM-DD') >= '2011-12-23' and to_char(t.daddtime, 'YYYY-MM-DD') < '2011-12-31';
导出csv
2.再把csv转换成txt(故意导出csv,再转化成txt的):
本地excel另存为
3.rz 至服务器:
[yangkai@admin ~]$ rz
//选择文件
4.处理:
[yangkai@admin ~]$ head yachi.txt
PARENTURL
[yangkai@admin ~]$ awk -F'[//,/]' '{$1=$1;print $1,$2,$3}' yachi.txt | sort | uniq -c | sort -nr |head
1969 http: weibo.com
944 http: passport.sohu.com
557 http: blog.sina.com.cn
181 http: v.ifeng.com
107 http: news.163.com
105 http:
87 http: v.qq.com
75 http: news.ifeng.com
66 http: pic.yule.sohu.com
64 http: news.sina.com.cn
[yangkai@admin ~]$
[yangkai@admin ~]$ awk -F'[//,/]' 'BEGIN{OFS="/"}{$1=$1;print $1,$2,$3}' yachi.txt | sort | uniq -c | sort -nr |head
1969
944
557 http://blog.sina.com.cn
181
107
105 http://
87
75
66
64
[yangkai@admin ~]$ awk -F'[//,/]' 'BEGIN{OFS="/"}{$1=$1;print $1,$2,$3}' yachi.txt | sort | uniq -c | sort -nr >yachiurl.txt
注:
其中,// ,在以/为FS的时候,他们中间多了空格。。。
5.再找一级域名:
[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{$1=$1;if(NF<5){print $(NF-1),$(NF)}else if(NF>4){print $(NF-1),$(NF)}}' yachiurl.txt | wc -l
166
[yangkai@admin ~]$ wc -l yachiurl.txt
166 yachiurl.txt
[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{$1=$1;if(NF<5){print $(NF-1),$(NF)}else if(NF>4){print $(NF-1),$(NF)}}' yachiurl.txt |sort | uniq -c >yijiurl.txt
[yangkai@admin ~]$ cat yijiurl.txt
1 .
2 126.com
48 163.com
16 com.cn
12 ifeng.com
1 live.com
39 qq.com
1 reweibo.com
23 sohu.com
21 weibo.com
1 Weibo.com
1 yahoo.com
[yangkai@admin ~]$
[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{$1=$1;if(NF<5){print $(NF-1),$(NF)}else if(NF>4){print $(NF-1),$(NF)}}' yachiurl.txt |sort | uniq -c | sort -nr
48 163.com
39 qq.com
23 sohu.com
21 weibo.com
16 com.cn
12 ifeng.com
2 126.com
1 yahoo.com
1 Weibo.com
1 reweibo.com
1 live.com
1 . #这个是最初我Oracle里面导出的字段名的那行。
[yangkai@admin ~]$
#上面结果不对,因为有.com.cn的域名,所以:
[yangkai@admin ~]$ awk -F'[.,/]' '{print NF}' yachiurl.txt
#结果省略
[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{$1=$1;if(NF<7){print $(NF-1),$(NF)}else if(NF=7){print $(NF-2),$(NF-1),$(NF)}}' yachiurl.txt | sort |uniq -c | sort -nr
45 163.com
36 qq.com
21 weibo.com
19 sohu.com
12 ifeng.com
9 sina.com.cn
4 com.cn
3 t.qq.com
3 news.sohu.com
3 blog.163.com
2 news.sina.com
2 126.com
1 yahoo.com
1 Weibo.com
1 reweibo.com
1 live.com
1 it.sohu.com
1 finance.sina.com
[yangkai@admin ~]$
还是不对,因为有这样的url,所以找准规律很重要,尤其是文件很多的时候,很麻烦。
[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{$1=$1;if(NF==6&&$0~/cn/){print}}' yachiurl.txt
557 http:..blog.sina.com.cn
64 http:..news.sina.com.cn
58 http:..finance.sina.com.cn
2 http:..owecn.blog.163.com
2 http:..mail.cn.yahoo.com
1 http:..mail.sina.com.cn
[yangkai@admin ~]$
#或者采用下面的方法更好:
[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{if($0~/.cn/){print $(NF-2),$(NF-1),$(NF)}else {print $(NF-1),$(NF)}}' yachiurl.txt |sort |uniq -c |sort -nr
47 163.com
39 qq.com
23 sohu.com
21 weibo.com
16 sina.com.cn
12 ifeng.com
2 126.com
1 Weibo.com
1 reweibo.com
1 live.com
1 cn.yahoo.com
1 blog.163.com
[yangkai@admin ~]$awk -F'[.,/]' 'BEGIN{OFS="."}{if($0~/.cn/){print $(NF-2),$(NF-1),$(NF)}else {print $(NF-1),$(NF)}}' yachiurl.txt |sort |uniq -c |sort -nr |sed 's/\.\./\/\//g'
注:[yangkai@admin ~]$ uniq --help
用法:uniq [选项]... [输入 [输出]]
Discard all but one of successive identical lines from INPUT (or
standard input), writing to OUTPUT (or standard output).
长选项必须用的参数在使用短选项时也是必须的。
-c, --count prefix lines by the number of occurrences
-d, --repeated only print duplicate lines
-D, --all-repeated[=delimit-method] print all duplicate lines
delimit-method={none(default),prepend,separate}
Delimiting is done with blank lines.
-f, --skip-fields=N avoid comparing the first N fields
-i, --ignore-case ignore differences in case when comparing
-s, --skip-chars=N avoid comparing the first N characters
-u, --unique only print unique lines
-w, --check-chars=N compare no more than N characters in lines
--help 显示此帮助信息并退出
--version 输出版本信息并退出
6.结束语
本文重点:字段分割符,输出字段分隔符的使用,if的使用,规律的寻找。
阅读(1775) | 评论(0) | 转发(0) |