原博客地址:
https://www.cnblogs.com/dengyg200891/p/6060010.html
1 # -*- coding:utf-8 -*- 2 #python 2.7 3 #XiaoDeng 4 #http://tieba.baidu.com/p/2460150866 5 #标签操作 6 7 8 from bs4 import BeautifulSoup 9 import urllib.request10 import re11 12 13 #如果是网址,可以用这个办法来读取网页14 #html_doc = "http://tieba.baidu.com/p/2460150866"15 #req = urllib.request.Request(html_doc) 16 #webpage = urllib.request.urlopen(req) 17 #html = webpage.read()18 19 20 21 html="""22The Dormouse's story 23 24The Dormouse's story
25Once upon a time there were three little sisters; and their names were26 ,27 Lacie and28 Tillie;29 Lacie30 and they lived at the bottom of a well.
31...
32 """33 soup = BeautifulSoup(html, 'html.parser') #文档对象34 35 36 #查找a标签,只会查找出一个a标签37 #print(soup.a)# 38 39 for k in soup.find_all('a'):40 print(k)41 print(k['class'])#查a标签的class属性42 print(k['id'])#查a标签的id值43 print(k['href'])#查a标签的href值44 print(k.string)#查a标签的string45 #tag.get('calss'),也可以达到这个效果
在使用该方法的k['href']读取网页链接时,编译器报错:
KeyError: 'href'
修改为:
k.get('href')
成功运行,取出href中的链接。