Chinaunix首页 | 论坛 | 博客
  • 博客访问: 4569
  • 博文数量: 2
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 20
  • 用 户 组: 普通用户
  • 注册时间: 2016-03-16 17:41
文章分类
文章存档

2016年(2)

我的朋友
最近访客

分类: Python/Ruby

2016-04-07 22:01:39

Beautiful Soup 4 Documentation

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Installing Beautiful Soup

Beautiful Soup 4 is published through PyPi, so if you can't install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python2 and Python3.

$ easy_install beautifulsoup4
$ pip install beautifulsoup4

Beautiful Soup supports the HTML parser included in Python's standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Another alternative is the pure-Python html5lib parser, which parses HTML the way a wel browser does. If you can, I recommend you install and use lxml for speed. If you're using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it's essential that you install lxml or html5lib -- Python's built-in HTML parser is just not very good in older verion.

Making the soup

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle. First, the document is converted to Unicode, and HTML entities are converted to Unicode characters.

Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. You can use "html", "xml" or "html5" to specify which type of markup you want to parse, or you can also say which parser library, including "lxml", "html5lib" and "html.parser", to be used as an HTML parser.

Kind of objects

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you'll only ever have to deal with about four kinds of objects: BeautifulSoup, Tag, NavigableString and Comment.

Tag

A Tag object corresponds to an XML or HTML tag in the original document, it has a lot of attributes and methods, the two generic features of which are the name and the attributes.

Name Every tag has a name, accessible as .name. If you change a tag's name, the change will be reflected in any HTML markup generated by Beautiful Soup. Attributes A tag may have any number of attributes. You can access a tag's attibutes by treating the tag like a dictionary, ['']. You can also access that dictionary directly by .attrs. You can add, remove (by del, and modify a tag's attributes, again, this is done by treating the tag as a dictionary. Multi-valued attributes Beautiful Soup present the value(s) of a multi-valued attributes as a list. If an attribute looks like it has more than one value (separated by whites), but it's not a multi-valued attribute as defined by any verson of the HTML standard, Beautiful Soup will leave the attribute alone. When it goes to XML, there are no multi-valued attributes.

NavigableString

A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text. You can convert a NavigableString to a Unicode string with unicode(). If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() to turn it into a normal Python Unicode string. Otherwise, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you're done using Beautiful Soup. This is a big waste of memory.

BeautifulSoup

The BeautifulSoup object itself represents the document as a whole. Since it doesn't correspond to an actual HTML or XML tag, it has been given a special .name "[document]".

Comments and other special strings

Beautiful Soup defines classes for anything else that might show up in an XML document: CData, ProcessingInstruction, Declaration and Doctype. Just like Comment, these classes are subclasses of NavigableString that add something extra to the string.

Navigating the tree

Children

BeautifulSoup and Tags may contain NavigableStrings and other Tags, these elements are their "children". You access them like acting on the member of an object. For example, if you want the "" tag in the BeautifulSoup object "soup", just say "soup.head". You can do use this trick again and again to zoom in a certain part of the parse tree.

You can use the .contents and .children attributes to get the direct children of BeautifulSoup and Tags. The difference is that, .contents is a list while .children a generator. The .descendants attribute lets you iterate over all the children, recursively: the direct children, the children of its direct children, and so on.

Through the .strings and .stripped_strings generators, you can iterate the NavigableString objects inside a BeautifulSoup or a Tag and their children. On the other hand, you can extract all the text part and join them together by simply invoking.get_text(). If only a single NavigableString exists, it's extracted by .string, or you will get a None.

Parent

You can iterate over all of an element's parents with .parents generator, or only the closest one with .parent. Note that the .parent of a BeautifulSoup object is defined as None.

Sibling

You can use .next_sibling and .previous_sibling to navigate between page elements that on the same level of the parse tree, and traverse them all by .next_siblings and .previous_siblings.

Back and Forth

The .next_element attribute points to what was parsed immediately afterwards. It might be the same as .next_sibling, but it's often drastically different, usually the first child instead. And also, the .previous_element, .next_elements and .previous_elements exist.

Search the tree

The two most popular methods for searching the parse tree are .find() and .find_all(), the other methods take almost exactly the same arguments.

Kinds of filters

The filters are very important in searching methods, you can use them to filter based on a tag's name, on its attributes, on the text of a string, or on some combination of these.

a string Pass a string to a search method and Beautiful Soup perform a match against that exact string. If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead. a regular expression If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its match() method. a list If you pass in a list, Beautiful Soup will allow a string match against any item in that list. True The value True matches everything it can. function You can also filter by defining a function that takes an element as its only argument. The function should return True if the argument matches, and False otherwise. Note that, if you pass in a function to filter on a specific attribute, the argument passed into the function will be the attribute value, not the whole tag.

Searching methods

Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

The find_all() method looks through a tag's descendants and retrieves all descendants that match your filters.

nameonly consider tags with certain names keywordfilter on one or more of a tag's attributes attrsSome attributes, like the "data-*" attributes in HTML 5, have names that can't be used as names of keyword arguments. You can use these attributes in searches by put them into a dictionary and passing it into find_all() as the attrs argument. stringsearch for strings instead of tags. Watch out, when used the function returns a list of NavigableString objects instead of Tags, unless other arguments specified.

The limit argument specifies how many result to be returned at most. If you only want Beautiful Soup to consider direct children, you can pass in recursive=False, this argument is only supported by find_all() and find().

Because find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object or a Tag object as though it were a function, then it's the same as calling find_all() on that object.

Searching methods summary:

find() find_all()
find_parent() find_parents()
find_next_sibling() find_next_siblings()
find_previous_sibling() find_previous_siblings()
find_next() find_all_next()
find_previous() find_all_previous()

Searching by CSS class

It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, "class", is a reserved word in Python. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_. Remember that a single tag can have multiple values for its "class" attribute. When you search for a tag that matches a certain CSS class, you're matching against any of its CSS classes. You cannot use the exact string to search for the tags with multiple values for its "class" attribute, but a CSS selector works.

Beautiful Soup supports the most commonly-used CSS selectors. Just pass a string into the .select() method of a Tag or BeautifulSoup object itself. The format of the string refers to the CSS syntax.

Modifying the tree

You can change the Tag objects directly by assigning values to their .names, .stirngs and attributes(like a dictionary). Other memeber variables, such as .contents, .parent and so forth, should be used with cautions, they may not work.

The means above cannot change the content conveniently. To do this, it's more feasible to invoke member methods. For example, you can add strings, NavigableStrings, Comments and Tags (generated by BeautifulSoup.new_tag()) to a BeautifulSoup or a Tag through .append() or .insert(). Other modification methods are:

.insert_before() inserts a tag or string (specified by the argument) immediately before the invoking element. .insert_after() inserts a tag or string (specified by the argument) immediately after the invoking element. Tag.clear() removes the contents of a tag. .extract() removes a tag or string from the parsing tree and returns it. Tag.decompose() removes a tag from the tree, then completely destroys it and its contents. .replace_with() replaces the invoking element with the specified one, returns the original one. .wrap() wraps an element in the tag you specify. Tag.unwrap() removes the invoking tag and returns it, leaving the content unchanged.

Output

If you give Beautiful Soup a document that contains HTML entities like "&lquot;", they will be converted to Unicode characters. The .prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with each HTML/XML tag on its own line. By default, ampersands and angle brackets are escaped upon output, they get turned into "&", "<" and ">", so that Beautiful Soup doesn't inadvertently generate invalid HTML or XML. You can change this behavior by providing a value for the formatter argument to .prettify(), .encode() or .decode(), in which four possible values are recognized: "minimal"(default), "html", None and your self-defined function.

One last caveat: if you create a CData object, the text inside that object is always presented exactly as it appears, with no formatting.

Encodings

When you load a document into Beautiful Soup, a sub-library called UnicodeDammit wiil be used to detect the document's encoding and convert it to Unicode. The autodetected encoding is available as the .original_encoding attribute of the BeautifulSoup object. If you happen to know a document's encoding ahead of time, you can avoid mistakes and delays by passing it to the BeautifulSoup constructor as the from_encoding argument. If you don't know what the correct encoding is, but you know that UnicodeDammit is gussing wrong, you can pass the wrong guesses in as the exclude_encodings argument in list.

In rare cases, usually when a UTF-8 document contains text written in a completely different encoding, the only way to get Unicode may be to replace some characters with the special Unicode character "REPLACEMENT CHARACTER". If UnicodeDammit needs to do this, it will set the .contains_replacement_characters attribute of the UnicodeDammit or BeautifulSoup object to True. THis lets you know that the Unicode representation is not an exact representaion of the original -- some data was lost.

When you write out a document from Beautiful Soup, you can choose an alternative encoding by passing it into .prettify() or .encode(). Any character that can't be represented in your chosen encoding will be converted into numeric XML entity references.

You can use UnicodeDammit without using BeautifulSoup. It's useful whenever you have data in an unknown encoding and you just want it to become Unicode.

Parsing only part of a document

The SoupStrainer class allows you to choose which parts of an incoming document are parsed. You just create a SoupStrainer and pass it in to the BeautifulSoup constructor as the pass_only argument. Note that this feature won't work if you're using the html5lib parser.

The SoupStrainer class takes the same arguments as a typical searching method, with the arguments:name, attrs, string and **kwargs.

The SoupStrainer can also be passed as an argument to searching methods.

阅读(743) | 评论(0) | 转发(0) |
0

上一篇:Encoding and Decoding -- with Implementation in Python

下一篇:没有了

给主人留下些什么吧!~~