awk 科瓦齿科url统计-linux_kaige-ChinaUnix博客

一个人不是生来要给打败的yangkai.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

linux_kaige

博客访问： 1667450
博文数量： 409
博客积分： 6240
博客等级：准将
技术积分： 4908
用户组：普通用户
注册时间： 2011-06-01 00:04

文章分类

全部博文（409）

Oracle服务端应用（24）
信息安全（3）
网络（虚拟化等全（1）
代理服务器squid（2）
Hadoop 大数（1）
oracle的sql（1）
标准对照表（2）
ATM相关（2）
oracle异常处理（1）
oracle存储过程/（7）
oracle备份与恢复（12）
java（3）
db2（1）
环境搭建（4）
perl（0）
php（1）
python（0）
object（2）
HA（1）
sql（14）
mysql（3）
网卡流量监测（3）
web服务器（6）
hash（1）
我的ftp手册（7）
文档规范（1）
正则（2）
sed（2）
项目管理（28）
web压力测试（5）
linux系统管理（25）
shell（19）
linux应用（20）
随便一说（3）
其他（6）
kernel（1）
awk（17）
DB（21）
凯哥的linux私房（20）
未分配的博文（137）

文章存档

2021年（1）

2019年（1）

2017年（1）

2016年（13）

2015年（22）

2013年（4）

2012年（240）

2011年（127）

我的朋友

相关博文

awk 科瓦齿科url统计

分类： Python/Ruby

2012-01-10 10:59:16

1.从数据库里面导出科瓦齿科的点击的全部url：

select t.parenturl

from tstatisticscount_new t

where t.sourceurl = 'http://***.***.71.228/ad/fy/kewacike.html'

and t.parenturl is not null and to_char(t.daddtime, 'YYYY-MM-DD') >= '2011-12-23' and to_char(t.daddtime, 'YYYY-MM-DD') < '2011-12-31';

导出csv

2.再把csv转换成txt（故意导出csv，再转化成txt的）:

本地excel另存为

3.rz 至服务器：

[yangkai@admin ~]$ rz

//选择文件

4.处理：

[yangkai@admin ~]$ head yachi.txt

PARENTURL

[yangkai@admin ~]$ awk -F'[//,/]' '{$1=$1;print $1,$2,$3}' yachi.txt | sort | uniq -c | sort -nr |head

1969 http: weibo.com

944 http: passport.sohu.com

557 http: blog.sina.com.cn

181 http: v.ifeng.com

107 http: news.163.com

105 http:

87 http: v.qq.com

75 http: news.ifeng.com

66 http: pic.yule.sohu.com

64 http: news.sina.com.cn

[yangkai@admin ~]$

[yangkai@admin ~]$ awk -F'[//,/]' 'BEGIN{OFS="/"}{$1=$1;print $1,$2,$3}' yachi.txt | sort | uniq -c | sort -nr |head

1969

944

557 http://blog.sina.com.cn

181

107

105 http://

[yangkai@admin ~]$ awk -F'[//,/]' 'BEGIN{OFS="/"}{$1=$1;print $1,$2,$3}' yachi.txt | sort | uniq -c | sort -nr >yachiurl.txt

注：

其中，// ，在以/为FS的时候，他们中间多了空格。。。

5.再找一级域名：

[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{$1=$1;if(NF<5){print $(NF-1),$(NF)}else if(NF>4){print $(NF-1),$(NF)}}' yachiurl.txt | wc -l

166

[yangkai@admin ~]$ wc -l yachiurl.txt

166 yachiurl.txt

[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{$1=$1;if(NF<5){print $(NF-1),$(NF)}else if(NF>4){print $(NF-1),$(NF)}}' yachiurl.txt |sort | uniq -c >yijiurl.txt

[yangkai@admin ~]$ cat yijiurl.txt

1 .

2 126.com

48 163.com

16 com.cn

12 ifeng.com

1 live.com

39 qq.com

1 reweibo.com

23 sohu.com

21 weibo.com

1 Weibo.com

1 yahoo.com

[yangkai@admin ~]$

[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{$1=$1;if(NF<5){print $(NF-1),$(NF)}else if(NF>4){print $(NF-1),$(NF)}}' yachiurl.txt |sort | uniq -c | sort -nr

48 163.com

39 qq.com

23 sohu.com

21 weibo.com

16 com.cn

12 ifeng.com

2 126.com

1 yahoo.com

1 Weibo.com

1 reweibo.com

1 live.com

1 . #这个是最初我Oracle里面导出的字段名的那行。

[yangkai@admin ~]$

#上面结果不对，因为有.com.cn的域名，所以：

[yangkai@admin ~]$ awk -F'[.,/]' '{print NF}' yachiurl.txt

#结果省略

[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{$1=$1;if(NF<7){print $(NF-1),$(NF)}else if(NF=7){print $(NF-2),$(NF-1),$(NF)}}' yachiurl.txt | sort |uniq -c | sort -nr

45 163.com

36 qq.com

21 weibo.com

19 sohu.com

12 ifeng.com

9 sina.com.cn

4 com.cn

3 t.qq.com

3 news.sohu.com

3 blog.163.com

2 news.sina.com

2 126.com

1 yahoo.com

1 Weibo.com

1 reweibo.com

1 live.com

1 it.sohu.com

1 finance.sina.com

[yangkai@admin ~]$

还是不对，因为有这样的url，所以找准规律很重要，尤其是文件很多的时候，很麻烦。

[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{$1=$1;if(NF==6&&$0~/cn/){print}}' yachiurl.txt

557 http:..blog.sina.com.cn

64 http:..news.sina.com.cn

58 http:..finance.sina.com.cn

2 http:..owecn.blog.163.com

2 http:..mail.cn.yahoo.com

1 http:..mail.sina.com.cn

[yangkai@admin ~]$

#或者采用下面的方法更好：

[yangkai@admin ~]$ awk -F'[.,/]' 'BEGIN{OFS="."}{if($0~/.cn/){print $(NF-2),$(NF-1),$(NF)}else {print $(NF-1),$(NF)}}' yachiurl.txt |sort |uniq -c |sort -nr

47 163.com

39 qq.com

23 sohu.com

21 weibo.com

16 sina.com.cn

12 ifeng.com

2 126.com

1 Weibo.com

1 reweibo.com

1 live.com

1 cn.yahoo.com

1 blog.163.com

[yangkai@admin ~]$awk -F'[.,/]' 'BEGIN{OFS="."}{if($0~/.cn/){print $(NF-2),$(NF-1),$(NF)}else {print $(NF-1),$(NF)}}' yachiurl.txt |sort |uniq -c |sort -nr |sed 's/\.\./\/\//g'

注：[yangkai@admin ~]$ uniq --help

用法：uniq [选项]... [输入 [输出]]

Discard all but one of successive identical lines from INPUT (or

standard input), writing to OUTPUT (or standard output).

长选项必须用的参数在使用短选项时也是必须的。

-c, --count prefix lines by the number of occurrences

-d, --repeated only print duplicate lines

-D, --all-repeated[=delimit-method] print all duplicate lines

delimit-method={none(default),prepend,separate}

Delimiting is done with blank lines.

-f, --skip-fields=N avoid comparing the first N fields

-i, --ignore-case ignore differences in case when comparing

-s, --skip-chars=N avoid comparing the first N characters

-u, --unique only print unique lines

-w, --check-chars=N compare no more than N characters in lines

--help 显示此帮助信息并退出

--version 输出版本信息并退出

6.结束语

本文重点：字段分割符，输出字段分隔符的使用，if的使用，规律的寻找。

阅读(1794) | 评论(0) | 转发(0) |

上一篇：yum remove software

下一篇：习惯的力量

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6