Loyiha: Veb-sahifalarni Skanerlash va Ma’lumot yig‘ish
Loyiha Asosiy Maqsadlari
Veb-saytga foydalanuvchi agentlarini sozlab so‘rov yuborish.
Veb-sahifadagi User-Agent javoblarini tekshirish.
Tahlil qilish uchun ma’lumot yig‘ish (masalan, havolalar va fayllar).
Fayllarni saytdan yuklab olish.
Loyiha Kodu
Quyidagi kod loyiha uchun to‘liq dastur bo‘lib, bosqichma-bosqich tahlil qilib berilgan.
import requests
from bs4 import BeautifulSoup
import random
import itertools
# 1. Foydalanuvchi agentlari ro'yxati
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15",
"Mozilla/5.0 (iPhone; CPU iPhone OS 14_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"
]
# 2. Aylanma usulda user_agents o'zgaruvchisini sozlash
user_agents_cycle = itertools.cycle(user_agents)
# 3. Tasodifiy User-Agent tanlash
def get_random_user_agent():
"""
Foydalanuvchi agentlari ro'yxatidan tasodifiy User-Agentni tanlash.
"""
return random.choice(user_agents)
# 4. Sahifani yuklab olish va HTML ma'lumotni qaytarish
def get_html(url, user_agent):
"""
So'rov yuborish uchun User-Agent bilan birga URL'dan HTML ma'lumotni yuklab olish.
"""
headers = {"User-Agent": user_agent}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
else:
print(f"Xato! Status kodi: {response.status_code}")
return None
# 5. Sahifani analiz qilish va barcha havolalarni yig'ish
def get_all_links(html):
"""
HTML sahifadagi barcha havolalarni ajratib olish funksiyasi.
"""
soup = BeautifulSoup(html, 'html.parser')
links = [link['href'] for link in soup.find_all('a', href=True)]
return links
# 6. Saytdan fayllarni yuklab olish
def download_files(url, file_extension="pdf"):
"""
Saytdan ko'rsatilgan kengaytmali fayllarni yuklab olish.
"""
html = get_html(url, get_random_user_agent())
if html is None:
return
links = get_all_links(html)
for link in links:
if link.endswith(file_extension):
file_url = link if link.startswith("http") else url + link
file_name = file_url.split("/")[-1]
try:
print(f"Yuklanmoqda: {file_name}")
file_data = requests.get(file_url)
with open(file_name, "wb") as file:
file.write(file_data.content)
print(f"Yuklandi: {file_name}")
except requests.RequestException as e:
print(f"Yuklab olishda xatolik: {e}")
# 7. Sinov uchun asosiy funksiyalarni chaqirish
def main():
"""
Asosiy loyiha funksiyalarini bajarish.
"""
url = "https://example.com"
# 1. User-Agent bilan sahifa yuklash
user_agent = get_random_user_agent()
print(f"User-Agent: {user_agent}")
html = get_html(url, user_agent)
# 2. Barcha havolalarni chiqarish
if html:
links = get_all_links(html)
print("Topilgan havolalar:", links)
# 3. Saytdan PDF fayllarni yuklab olish
download_files(url, file_extension="pdf")
# Loyihani ishga tushirish
main()
Kod Tahlili
1 Foydalanuvchi Agentlari Ro'yxati
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15",
"Mozilla/5.0 (iPhone; CPU iPhone OS 14_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1"
]
Bu ro‘yxatda brauzerlar va qurilmalar uchun turli User-Agent lar mavjud. Bu User-Agent lar orqali server foydalanuvchi haqida turli ma’lumotlar qabul qiladi.
2 Aylanma Foydalanuvchi Agentlar
user_agents_cycle = itertools.cycle(user_agents)
itertools.cycle yordamida User-Agent larni aylanma tartibda ishlatish imkoniyati yaratiladi. Bu kelajakda har bir so‘rovda turli User-Agent larni ketma-ket ishlatish uchun kerak bo‘ladi.
headers = {"User-Agent": user_agent} – Berilgan User-Agent bilan headers sozlanadi.
response.text – Sahifaning HTML matnini qaytaradi, agar status_code 200 bo‘lsa.
5 Barcha Havolalarni Ajratib Olish Funksiyasi
def get_all_links(html):
soup = BeautifulSoup(html, 'html.parser')
links = [link['href'] for link in soup.find_all('a', href=True)]
return links
soup.find_all('a', href=True) – Sahifadagi barcha a teglarini topadi va href qiymatini ajratib oladi.
links – href qiymatlarini saqlaydigan ro‘yxat.
6 Fayllarni Yuklab Olish Funksiyasi
def download_files(url, file_extension="pdf"):
html = get_html(url, get_random_user_agent())
if html is None:
return
links = get_all_links(html)
for link in links:
if link.endswith(file_extension):
file_url = link if link.startswith("http") else url + link
file_name = file_url.split("/")[-1]
try:
print(f"Yuklanmoqda: {file_name}")
file_data = requests.get(file_url)
with open(file_name, "wb") as file:
file.write(file_data.content)
print(f"Yuklandi: {file_name}")
except requests.RequestException as e:
print(f"Yuklab olishda xatolik: {e}")
Bu funksiya berilgan file_extension formatidagi fayllarni yuklab olish uchun ishlatiladi:
file_url – Fayl URL manzili.
file.write(file_data.content) – Fayl ma’lumotini binar formatda yozadi.