如何使用htmlq提取html文件内容-梦共里醉-ChinaUnix博客

老率的IT私房菜

首页　| 　博文目录　| 　关于我

梦共里醉

博客访问： 1007208
博文数量： 481
博客积分： 0
博客等级：民兵
技术积分： 5078
用户组：普通用户
注册时间： 2018-03-07 14:48

个人简介

分享工作和学习中的点点滴滴，包括前端、后端、运维、产品等各个方面，欢迎您来关注订阅！

文章分类

全部博文（481）

学习心得（470）
未分配的博文（11）

文章存档

2023年（26）

2022年（97）

2021年（119）

2020年（153）

2019年（70）

2018年（16）

我的朋友

相关博文

如何使用htmlq提取html文件内容

分类： LINUX

2022-12-08 02:06:15

htmlq能够对 HTML 数据进行 sed 或 grep 操作。我们可以使用 htmlq 搜索、切片和过滤 HTML 数据。让我们看看如何在或 Unix 上安装和使用这个方便的工具并处理 HTML 数据。

什么是htmlq？

htmlq类似于 jq，但用于 HTML。使用 CSS 选择器从 HTML 文件中提取部分内容。在 CSS 中，选择器用于定位我们想要设置样式的网页上的 HTML 元素。例如，我们可以使用此工具轻松提取图像或其他 URL。

安装htmlq

首先需要在系统中安装cargo然后使用cargo来安装htmlq：

[root@localhost ~]# yum -y install cargo
[root@localhost ~]# cargo install htmlq

设置可执行的路径

确保将 $HOME/.cargo/bin 添加到 PATH 变量中，以便能够使用 export 运行已安装的二进制文件：

[root@localhost ~]# echo 'export PATH="$PATH:$HOME/.cargo/bin"' >> ~/.bash_profile 
[root@localhost ~]# . ~/.bash_profile

如何使用 htmlq 从 HTML 文件中提取内容？

下面是使用curl和htmlq的用法：

curl -s url | htmlq '#css-selector'
curl -s url2 | htmlq '.css-selector'
curl -s  | htmlq --pretty '#content' | more

让我们找到页面中的所有链接。例如：

[root@localhost ~]# curl -s  | htmlq --attribute href a

如何使用htmlq提取html文件内容如何使用htmlq提取html文件内容
人性化显示HTML:

[root@localhost ~]# curl --silent  | htmlq --pretty '#posts'

如何使用htmlq提取html文件内容如何使用htmlq提取html文件内容

帮助手册

使用下面查看帮助页面：

[root@localhost ~]# htmlq --help
htmlq 0.3.0
Michael Maclean 
Runs CSS selectors on HTML

USAGE:
    htmlq [FLAGS] [OPTIONS] [selector]...

FLAGS:
    -B, --detect-base          Try to detect the base URL from the  tag in the document. If not found, default to
                               the value of --base, if supplied
    -h, --help                 Prints help information
    -w, --ignore-whitespace    When printing text nodes, ignore those that consist entirely of whitespace
    -p, --pretty               Pretty-print the serialised output
    -t, --text                 Output only the contents of text nodes inside selected elements
    -V, --version              Prints version information

OPTIONS:
    -a, --attribute     Only return this attribute (if present) from selected elements
    -b, --base               Use this URL as the base for links
    -f, --filename           The input file. Defaults to stdin
    -o, --output             The output file. Defaults to stdout

ARGS:
    ...    The CSS expression to select [default: html]

如何使用htmlq提取html文件内容如何使用htmlq提取html文件内容

总结

htmlq能够对 HTML 数据进行 sed 或 grep 操作。我们可以使用 htmlq 搜索、切片和过滤 HTML 数据。

阅读(354) | 评论(0) | 转发(0) |

上一篇：SQLite 基本命令使用方式

下一篇：10大白帽黑客专用的 Linux 操作系统

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6