from bs4 import BeautifulSoup
html = """
<html>
<head><title>Misol Sahifa</title></head>
<body>
<h1>Sarlavha</h1>
<p class="paragraf">Bu oddiy paragraf.</p>
<a href="https://example.com">Link</a>
</body>
</html>
"""
# HTMLni parse qilish
soup = BeautifulSoup(html, "lxml")
DOM elementlarni qidirish
Element bo‘yicha qidirish:
print(soup.title) # <title>Misol Sahifa</title>
print(soup.h1.text) # Sarlavha
Class orqali qidirish:
pythonCopy codeparagraf = soup.find("p", class_="paragraf")
print(paragraf.text) # Bu oddiy paragraf.
Havolalarni olish:
link = soup.find("a")
print(link['href']) # https://example.com
Bir nechta elementlarni olish
from bs4 import BeautifulSoup
html = """
<ul>
<li>Element 1</li>
<li>Element 2</li>
<li>Element 3</li>
</ul>
"""
soup = BeautifulSoup(html, "lxml")
elements = soup.find_all("li")
for element in elements:
print(element.text)
2 Scrapy bilan ishlash
Scrapy o'rnatish
pip install scrapy
Yangi loyiha yaratish
scrapy startproject myproject
Scrapy ning asosiy tuzilmasi
spiders: Skriptingni amalga oshiradigan fayllar joylashgan papka.
items.py: Topilgan ma'lumotlarni saqlash uchun strukturani belgilash.
pipelines.py: Ma'lumotlarni saqlash yoki tozalash uchun ishlatiladi.
Oddiy spider yaratish
Spider faylini yaratish:
cd myproject
scrapy genspider example example.com
Spiderni yozish:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com']
def parse(self, response):
# Sarlavhani chiqarish
title = response.xpath("//title/text()").get()
print(f"Sarlavha: {title}")
XPath va CSS Selector bilan qidirish
XPath orqali:
response.xpath("//h1/text()").get() # H1 tagini olish
CSS Selector orqali:
response.css("h1::text").get() # H1 tagini olish
Ma'lumotlarni yig'ish va saqlash
JSON yoki CSV faylga yozish:
scrapy crawl example -o results.json
Pipeline orqali ma'lumotni qayta ishlash: pipelines.py faylini tahrirlash:
class MyProjectPipeline:
def process_item(self, item, spider):
# Ma'lumotni tozalash
item['title'] = item['title'].strip()
return item
Ba'zan ma'lumotlar AJAX orqali yuklanadi. Scrapy bilan so'rov yuborish orqali bu ma'lumotlarni olish mumkin:
import scrapy
class AjaxSpider(scrapy.Spider):
name = "ajax_example"
start_urls = ["https://example.com/ajax"]
def parse(self, response):
# JSON ma'lumotlarni olish
data = response.json()
for item in data['results']:
yield {
'name': item['name'],
'price': item['price']
}
Dinamik ma'lumotlarni olish (Selenium bilan birga ishlash)
Scrapy dinamik yuklangan sahifalarni olish uchun Selenium bilan birga ishlatilishi mumkin:
from scrapy import Spider
from selenium import webdriver
from selenium.webdriver.common.by import By
class SeleniumSpider(Spider):
name = "selenium_spider"
def __init__(self):
self.driver = webdriver.Chrome()
def start_requests(self):
urls = ["https://example.com"]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
self.driver.get(response.url)
elements = self.driver.find_elements(By.TAG_NAME, "h1")
for element in elements:
print(element.text)
Ma'lumotlarni qayta ishlash va filtrlash
Scrapy'da ma'lumotlarni filtrlash uchun ItemLoader dan foydalaning:
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose
class MySpider(scrapy.Spider):
name = "filtered_spider"
start_urls = ["https://example.com"]
def parse(self, response):
loader = ItemLoader(item={}, response=response)
loader.add_xpath("title", "//title/text()", TakeFirst())
loader.add_xpath("price", "//span[@class='price']/text()", MapCompose(str.strip))
yield loader.load_item()
Proksi va User-Agent o'zgartirish
Bloklanishni oldini olish uchun proksi yoki User-Agent ni sozlash: