Python學(xué)習(xí)教程：成語查詢工具-數(shù)據(jù)獲取

Python學(xué)習(xí)教程：成語查詢工具 - 數(shù)據(jù)獲取

專注于為中小企業(yè)提供成都做網(wǎng)站、網(wǎng)站建設(shè)服務(wù),電腦端+手機(jī)端+微信端的三站合一,更高效的管理,為中小企業(yè)聊城免費(fèi)做網(wǎng)站提供優(yōu)質(zhì)的服務(wù)。我們立足成都，凝聚了一批互聯(lián)網(wǎng)行業(yè)人才，有力地推動(dòng)了成百上千企業(yè)的穩(wěn)健成長，幫助中小企業(yè)通過網(wǎng)站建設(shè)實(shí)現(xiàn)規(guī)模擴(kuò)充和轉(zhuǎn)變。

我們從這個(gè)網(wǎng)站上獲取想要的內(nèi)容，不用考慮太多的板塊，直接按照字母檢索即可

Python學(xué)習(xí)教程：成語查詢工具 - 數(shù)據(jù)獲取

進(jìn)去每個(gè)字母的頁面中獲取數(shù)據(jù)以及循環(huán)頁數(shù)，值得注意的是頁面中有相當(dāng)多的重復(fù)項(xiàng)，記得進(jìn)行去重操作

1. 頁面獲取

常規(guī)套路，因?yàn)檫@里需要用到xpath，所以直接返回html字符串，這里因?yàn)閿?shù)據(jù)中有大量中文繁體字的原因，選擇字符編碼為gbk

def get_html(url):
 r = requests.get(url, headers=headers)
 r.encoding = 'gbk'
 return r.text

2. 當(dāng)前頁數(shù)據(jù)獲取

頁面中的成語以及釋義都是保存在列表中的，直接對(duì)列表遍歷獲取即可(僅當(dāng)前頁)，值得注意的是需要對(duì)重復(fù)項(xiàng)清洗，這里使用匿名函數(shù)lambda z: dict([(x, y) for y, x in z.items()]),對(duì)字典的鍵值執(zhí)行兩次翻轉(zhuǎn)

def get_curr(url):
 html = etree.HTML(get_html(url))
 lis = html.xpath('//li[@class="licontent"]')
 context = {}
 for li in lis:
 if li.xpath('./span[@class="hz"]/a/text()') and li.xpath('./span[@class="js"]/text()'):
 idiom = li.xpath('./span[@class="hz"]/a/text()')[0]
 interpretation = li.xpath('./span[@class="js"]/text()')[0]
 context[idiom] = interpretation
 func = lambda z: dict([(x, y) for y, x in z.items()])
 idiom_dict = func(func(context))
 return idiom_dict

3. 頁數(shù)循環(huán)

頁面底部有頁數(shù)的標(biāo)簽，包括總頁數(shù)、當(dāng)前頁、末頁、下一頁等，但是如果總頁面僅1頁的就沒有任何顯示，到達(dá)項(xiàng)目尾頁時(shí)就沒有任何頁數(shù)標(biāo)簽顯示了(怪不怪?),我們這里就獲取到總頁數(shù)和當(dāng)前的字母索引即可，這里的write_data和print是為了查看一下每個(gè)字母索引的數(shù)據(jù)情況，因?yàn)樽詈蟮膱?zhí)行會(huì)將數(shù)據(jù)寫入一個(gè)單獨(dú)的文件，如果你想要看到每個(gè)字母的成語，就可以取消這里的注釋查看

def run(url, context):
 html = etree.HTML(get_html(url))
 if html.xpath('//a[contains(text(), "末頁")]/@href'):
 text = html.xpath('//a[contains(text(), "末頁")]/@href')[0]
 letter = re.search('\w', text).group(0) or url.split('/')[-1][0]
 total = re.search('\d+', text).group(0) or 1
 else:
 letter = url.split('/')[-1][0]
 total = 1
 for num in range(1, int(total) + 1):
 page_context = get_curr('http://chengyu.kxue.com/pinyin/' + letter + '_' + str(num) + '.html')
 context.update(page_context)
 print("完成{}的添加,共{}".format(letter + '_' + str(num), total))
 #write_data('grandSon/' + url.split('/')[-1][0] + '.json', context)
 #print("完成{}的寫入".format(url.split('/')[-1][0]))
 return context

4. 數(shù)據(jù)寫入

直接轉(zhuǎn)成json寫入文件，可以調(diào)整一下格式

def write_data(file, context):
 with open(file, 'w', encoding='utf-8') as f:
 f.write(json.dumps(context, indent=2, ensure_ascii=False))

5. 遍歷所有字母

去網(wǎng)頁主頁遍歷所有字母的鏈接，然后對(duì)每個(gè)鏈接調(diào)用以上方法即可

url = "http://chengyu.kxue.com/"
 html = etree.HTML(get_html(url))
 file = 'idiom.json'
 context = {}
 urls = html.xpath('//div[@class="content letter"]/li/a/@href')
 for url in urls:
 context.update(run("http://chengyu.kxue.com" + url, {}))
 write_data(file, context)

伙伴們有不清楚的地方，可以留言，更多的關(guān)于 Python實(shí)戰(zhàn)和學(xué)習(xí)教程也會(huì)繼續(xù)為大家更新！

網(wǎng)頁標(biāo)題：Python學(xué)習(xí)教程：成語查詢工具-數(shù)據(jù)獲取
本文地址：http://www.chinadenli.net/article32/gccgpc.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供軟件開發(fā)、企業(yè)建站、網(wǎng)站內(nèi)鏈、網(wǎng)站收錄、面包屑導(dǎo)航、網(wǎng)站改版

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請(qǐng)盡快告知，我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如需處理請(qǐng)聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時(shí)需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容