Scrapowanie ceny produktu ze strony internetowej

0

Witam, na wstępie dodam że to mój pierwszy raz kiedy mam do czynienia z funkcją zczytywania ze strony określonych danych. Problem polega na tym, że krótki programik ma ze strony sklepu pobrać aktualną cenę. Udało mi się pobrać text "twoja cena" zamieszczonym w div-ie z clasą, ale już pobranie textu z div-a bez clasy wyrzuca błąd:

Traceback (most recent call last):
  File "/home/slavo/Dokumenty/Python/SI/test3.py", line 15, in <module>
    footer2 = link.find('span', class_='product-prices__price product-prices__price--big').get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

mój kod:

from bs4 import BeautifulSoup
from requests import get

url = 'https://www.tim.pl/wyszukiwanie/wyniki/?q=1131-136AA-KN659&p=1'
page = get(url)

bs = BeautifulSoup(page.content, 'lxml')

for link in bs.find_all('div', class_ = 'product-prices__label'):
    footer = link.find('span', class_ = 'product-prices__label-text').get_text()
    print(footer)

for link in bs.find_all('div'):
    footer2 = link.find('span', class_='product-prices__price product-prices__price--big').get_text()
    print(footer2)

Czy ktoś może mi pomóc w rozwiązaniu problemu?

2

link.find zwraca, None, czyli nic nie znaleziono, stąd nie ma i tekstu; może zmień kryteria wyszukiwania?
Obadaj to w chrome development tools.

1

requests.get ściąga tylko surowy html a tam nie ma jeszcze ceny tylko sam pusty div. Ta strona jest dynamiczna, czyli dopiero uruchomione skrypty js uzupełnią puste miejsce na cenę.

48

Jeśli to co przedmówcy piszą o renderowaniu przez js jest prawdą, to musisz zmienić podejście.
Tutaj jeden z tooli pozwalający scrapować taki content https://requests.readthedocs.io/projects/requests-html/en/latest/

0
ledi12 napisał(a):

Jeśli to co przedmówcy piszą o renderowaniu przez js jest prawdą, to musisz zmienić podejście.
Tutaj jeden z tooli pozwalający scrapować taki content https://requests.readthedocs.io/projects/requests-html/en/latest/

spróbuje, dzięki wszystkim za podpowiedź

2

Super ten requests_html. Normalnie to używałem driver do przeglądarki i sterowałem nią poprzez Selenium. Sprawdziłem i faktycznie działa:

#!/usr/bin/python3

from requests_html import HTMLSession

session = HTMLSession()

r = session.get('https://www.tim.pl/wyszukiwanie/wyniki/?q=1131-136AA-KN659&p=1')
r.html.render()
e = r.html.xpath('//*[@id="app"]/div/div[3]/div/div[2]/div/div[2]/div/div[2]/div[2]/div/div[2]/div[1]/div/div/span[1]')

print(e[0].attrs)

cena się wyrenderowała:

{'data-v-2cb8123b': '', 'title': '3,01 zł', 'class': ('product-prices__price', 'product-prices__price--big')}
0
jvoytech napisał(a):

Super ten requests_html. Normalnie to używałem driver do przeglądarki i sterowałem nią poprzez Selenium. Sprawdziłem i faktycznie działa:

#!/usr/bin/python3

from requests_html import HTMLSession

session = HTMLSession()

r = session.get('https://www.tim.pl/wyszukiwanie/wyniki/?q=1131-136AA-KN659&p=1')
r.html.render()
e = r.html.xpath('//*[@id="app"]/div/div[3]/div/div[2]/div/div[2]/div/div[2]/div[2]/div/div[2]/div[1]/div/div/span[1]')

print(e[0].attrs)

cena się wyrenderowała:

{'data-v-2cb8123b': '', 'title': '3,01 zł', 'class': ('product-prices__price', 'product-prices__price--big')}

u mnie niestety nie, wywala mi błąd pustej listy:

Traceback (most recent call last):
  File "/home/slavo/Dokumenty/Python/SI/test3.py", line 24, in <module>
    print(e[0].attrs)
IndexError: list index out of range
0

spróbowałem znależć wszystkie 'span' :

from requests_html import HTMLSession

session = HTMLSession()

r = session.get('https://www.tim.pl/wyszukiwanie/wyniki/?q=1131-136AA-KN659&p=1')

h = (r.html.find('span'))

for g in h:
    print(g)

wyniki są puste... nadal:

<Element 'span' class=('tim-icon',) style='height:16px;width:16px;' data-v-618a07e0='' data-v-70c1ef2c=''>
<Element 'span' class=('contact-info__text-header',) data-v-70c1ef2c=''>
<Element 'span' class=('tim-icon',) style='height:8px;width:8px;' data-v-618a07e0='' data-v-70c1ef2c=''>
<Element 'span' class=('tim-icon',) style='height:16px;width:16px;' data-v-618a07e0='' data-v-70c1ef2c=''>
<Element 'span' class=('user-icon',) data-v-61dab40b=''>
<Element 'span' class=('tim-icon', 'account-data__login-icon', 'account-data__login-icon-user') style='height:16px;width:16px;' data-v-618a07e0='' data-v-61dab40b=''>
<Element 'span' class=('fw-300', 'c-black--text') data-v-61dab40b=''>
<Element 'span' class=('tim-icon', 'account-data__login-icon', 'account-data__login-icon-chevron') style='height:7px;width:7px;' data-v-618a07e0='' data-v-61dab40b=''>
<Element 'span' class=('v-badge', 'ml-4', 'v-badge--left', 'v-badge--overlap') data-v-1f6fdf5a=''>
<Element 'span' class=('v-badge__badge', 'transparent')>
<Element 'span' class=('c-primary--text', 'c-accent', 'mr-3', 'px-1', 'fs-12', 'wishlist-icon__badge') data-v-1f6fdf5a=''>
<Element 'span' class=('ml-2', 'nowrap', 'fs-12') data-v-1f6fdf5a=''>
<Element 'span' class=('v-badge', 'ml-4', 'wishlist-icon__wrapper', 'v-badge--left', 'v-badge--overlap') data-v-1f6fdf5a=''>
<Element 'span' class=('v-badge__badge', 'transparent')>
<Element 'span' class=('c-primary--text', 'c-accent', 'mr-4', 'px-1', 'fs-12', 'wishlist-icon__badge') data-v-1f6fdf5a=''>
<Element 'span' class=('v-badge', 'ml-2', 'v-badge--left', 'v-badge--overlap') data-v-decf05b4=''>
<Element 'span' class=('tim-icon', 'cart-icon') style='height:38px;width:38px;' data-v-618a07e0='' data-v-decf05b4=''>
<Element 'span' class=('v-badge__badge', 'transparent')>
<Element 'span' class=('c-primary--text', 'c-accent', 'ml-1', 'px-1', 'fs-12', 'cart-icon__badge') data-v-decf05b4=''>
<Element 'span' data-t='minicart-price' class=('ml-2', 'nowrap', 'fs-12', 'fw-price') data-v-decf05b4=''>
<Element 'span' class=('v-badge', 'ml-3', 'cart-icon__wrapper', 'v-badge--left', 'v-badge--overlap') data-v-decf05b4=''>
<Element 'span' class=('tim-icon', 'cart-icon') style='height:37px;width:37px;' data-v-618a07e0='' data-v-decf05b4=''>
<Element 'span' class=('v-badge__badge', 'transparent')>
<Element 'span' class=('c-primary--text', 'c-accent', 'ml-1', 'px-1', 'fs-12', 'cart-icon__badge') data-v-decf05b4=''>
<Element 'span' class=('tim-icon',) style='height:28px;width:28px;' data-v-618a07e0='' data-v-0ddd0dc6=''>
<Element 'span' class=('tim-icon', 'search__clear-icon') style='height:14px;width:14px;' data-v-618a07e0='' data-v-32bf65cc=''>
<Element 'span' class=('tim-button__content',) data-v-c84b6252=''>
<Element 'span' class=('tim-icon',) style='height:20px;width:20px;' data-v-618a07e0='' data-v-32bf65cc=''>
<Element 'span' class=('el-loading', 'el-loading-w400', 'el-loading-h15', 'va-middle') data-v-5062fcf1=''>
<Element 'span' class=('el-loading', 'el-loading-w400', 'el-loading-h15', 'va-middle') data-v-5062fcf1=''>
<Element 'span' class=('el-loading', 'el-loading-w400', 'el-loading-h15', 'va-middle') data-v-5062fcf1=''>
<Element 'span' class=('el-loading', 'el-loading-w400', 'el-loading-h15', 'va-middle') data-v-5062fcf1=''>
<Element 'span' class=('el-loading', 'el-loading-w400', 'el-loading-h15', 'va-middle') data-v-5062fcf1=''>
<Element 'span' class=('el-loading', 'el-loading-w400', 'el-loading-h15', 'va-middle') data-v-5062fcf1=''>
<Element 'span' class=('el-loading', 'el-loading-w400', 'el-loading-h15', 'va-middle') data-v-5062fcf1=''>
<Element 'span' class=('el-loading', 'el-loading-w400', 'el-loading-h15', 'va-middle') data-v-5062fcf1=''>
<Element 'span' class=('el-loading', 'el-loading-w400', 'el-loading-h15', 'va-middle') data-v-5062fcf1=''>
<Element 'span' class=('el-loading', 'el-loading-w400', 'el-loading-h15', 'va-middle') data-v-5062fcf1=''>
<Element 'span' data-v-2cbd303c=''>
<Element 'span' class=('tim-icon', 'tree-level__button', 'tree-level__button--is-active') style='height:17px;width:17px;' data-v-618a07e0='' data-v-4bfbdf3e=''>
<Element 'span' class=('tree-level__amount',) data-v-4bfbdf3e=''>
<Element 'span' class=('tim-icon', 'tree-level__button', 'tree-level__button--is-active') style='height:17px;width:17px;' data-v-618a07e0='' data-v-4bfbdf3e=''>
<Element 'span' class=('tree-level__amount',) data-v-4bfbdf3e=''>
<Element 'span' class=('tree-level__dot',) data-v-4bfbdf3e=''>
<Element 'span' class=('tree-level__amount',) data-v-4bfbdf3e=''>
<Element 'span' class=('ml-1', 'fw-300', 'text-lowercase') data-v-2eaf50de=''>
<Element 'span' class=('tim-icon', 'base-search-input__clear-search-icon') style='height:16px;width:14px;' data-v-618a07e0='' data-v-7ce33578=''>
<Element 'span' class=('tim-icon', 'base-search-input__search-icon') style='height:20px;width:20px;' data-v-618a07e0='' data-v-7ce33578=''>
<Element 'span' class=('ml-1', 'fw-300', 'text-lowercase') data-v-2eaf50de=''>
<Element 'span' class=('tim-button__content',) data-v-c84b6252=''>
<Element 'span' data-v-c84b6252=''>
<Element 'span' class=('tim-button__content',) data-v-c84b6252=''>
<Element 'span' data-v-c84b6252=''>
<Element 'span' class=('list-seo-header--bold',)>
<Element 'span' class=('list-seo-header--bold',)>
<Element 'span' class=('tim-icon', 'sort-bar-list-type__icon', 'sort-bar-list-type__icon--active') style='height:18px;width:18px;' data-v-618a07e0='' data-v-61bfe9de=''>
<Element 'span' class=('tim-icon', 'tim-select__icon') style='height:10px;width:16px;' data-v-618a07e0='' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' class=('tim-button__content',) data-v-c84b6252=''>
<Element 'span' data-v-c84b6252=''>
<Element 'span' class=('tim-button__content',) data-v-c84b6252=''>
<Element 'span' data-v-c84b6252=''>
<Element 'span' class=('tim-icon', 'tim-select__icon') style='height:10px;width:16px;' data-v-618a07e0='' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' data-v-35c18d32=''>
<Element 'span' title='Simon 54 Premium Ramka pojedyncza biała DR1/11' class=('base-text-clamp', 'product-header__title') data-v-0fe603fe='' data-v-80c96f0a=''>
<Element 'span' class=('product-link-attribute__label',) data-v-2728c3e8=''>
<Element 'span' title='KONTAKT-SIMON' class=('base-text-clamp',) data-v-0fe603fe='' data-v-2728c3e8=''>
<Element 'span' class=('product-link-attribute__label',) data-v-2728c3e8=''>
<Element 'span' title='SIMON 54' class=('base-text-clamp',) data-v-0fe603fe='' data-v-2728c3e8=''>
<Element 'span' class=('product-simple-attribute__label',) data-v-20b419da=''>
<Element 'span' title='DR1/11' class=('base-text-clamp', 'product-simple-attribute__value') data-v-0fe603fe='' data-v-20b419da=''>
<Element 'span' class=('product-simple-attribute__label',) data-v-20b419da=''>
<Element 'span' title='1131-136AA-KN659' class=('base-text-clamp', 'product-simple-attribute__value') data-v-0fe603fe='' data-v-20b419da=''>
<Element 'span' class=('product-link-attribute__label',) data-v-2728c3e8=''>
<Element 'span' title='Ramki' class=('base-text-clamp',) data-v-0fe603fe='' data-v-2728c3e8=''>
<Element 'span' class=('listing-review-attribute__label',) data-v-7fd1e684=''>
<Element 'span' class=('listing-review-attribute__link-rate',) data-v-7fd1e684=''>
<Element 'span' class=('c-secondary--text', 'listing-review-attribute__stars') data-v-55941244='' data-v-7fd1e684=''>
<Element 'span' class=('tim-icon', 'icon-star', 'medium', 'icon-star--selected') style='height:12px;width:12px;' data-v-618a07e0='' data-v-55941244=''>
<Element 'span' class=('tim-icon', 'icon-star', 'medium', 'icon-star--selected') style='height:12px;width:12px;' data-v-618a07e0='' data-v-55941244=''>
<Element 'span' class=('tim-icon', 'icon-star', 'medium', 'icon-star--selected') style='height:12px;width:12px;' data-v-618a07e0='' data-v-55941244=''>
<Element 'span' class=('tim-icon', 'icon-star', 'medium', 'icon-star--selected') style='height:12px;width:12px;' data-v-618a07e0='' data-v-55941244=''>
<Element 'span' class=('tim-icon', 'icon-star', 'medium', 'icon-star--selected') style='height:12px;width:12px;' data-v-618a07e0='' data-v-55941244=''>
<Element 'span' class=('listing-review-attribute__link-reviews-number',) data-v-7fd1e684=''>
<Element 'span' class=('product-mobile-item__ref-number',) data-v-f0eef8bc=''>
<Element 'span' class=('mr-2',) data-v-31b7bf4c=''>
<Element 'span' class=('tim-icon', 'product-list-bottom-param__icon-chevron') style='height:16px;width:16px;' data-v-618a07e0='' data-v-31b7bf4c=''>
<Element 'span' class=('mr-2',) data-v-31b7bf4c=''>
<Element 'span' class=('tim-icon', 'product-list-bottom-param__icon-chevron') style='height:16px;width:16px;' data-v-618a07e0='' data-v-31b7bf4c=''>
<Element 'span' class=('mr-2',) data-v-31b7bf4c=''>
<Element 'span' class=('tim-icon', 'product-list-bottom-param__icon-chevron') style='height:16px;width:16px;' data-v-618a07e0='' data-v-31b7bf4c=''>
<Element 'span' class=('tim-button__content',) data-v-c84b6252=''>
<Element 'span' data-v-c84b6252=''>
<Element 'span' class=('tim-icon',) style='height:24px;width:24px;' data-v-618a07e0='' data-v-60911526=''>
<Element 'span' class=('tim-button-cta__text',) data-v-60911526=''>
<Element 'span' class=('tim-button__content',) data-v-c84b6252=''>
<Element 'span' data-v-c84b6252=''>
<Element 'span' class=('tim-icon',) style='height:24px;width:24px;' data-v-618a07e0='' data-v-60911526=''>
<Element 'span' class=('tim-button-cta__text',) data-v-60911526=''>
<Element 'span' class=('tim-button__content',) data-v-c84b6252=''>
<Element 'span' class=('tim-icon',) style='height:27px;width:27px;' data-v-618a07e0='' data-v-1916ffca=''>
<Element 'span' class=('tim-icon', 'icon--big') style='height:20px;width:20px;' data-v-618a07e0='' data-v-14383199=''>
<Element 'span' class=('tim-icon', 'newsletter__header-icon') style='height:36px;width:36px;' data-v-618a07e0='' data-v-f2b18a9c=''>
<Element 'span' class=('newsletter__header-title',) data-v-f2b18a9c=''>
<Element 'span' class=('tim-button__content',) data-v-c84b6252=''>
<Element 'span' data-v-c84b6252=''>
<Element 'span' class=('tim-icon', 'content__icon-arrow', 'arrow--right') style='height:20px;width:12px;' data-v-618a07e0='' data-v-c84b6252=''>
<Element 'span' class=('tim-icon', 'social-media__icon') style='height:36px;width:36px;' data-v-618a07e0='' data-v-ce647d8a=''>
<Element 'span' class=('tim-icon', 'social-media__icon') style='height:36px;width:36px;' data-v-618a07e0='' data-v-ce647d8a=''>
<Element 'span' class=('tim-icon', 'social-media__icon') style='height:36px;width:36px;' data-v-618a07e0='' data-v-ce647d8a=''>
<Element 'span' class=('tim-icon', 'social-media__icon') style='height:36px;width:36px;' data-v-618a07e0='' data-v-ce647d8a=''>
<Element 'span' class=('tim-icon', 'social-media__icon') style='height:36px;width:36px;' data-v-618a07e0='' data-v-ce647d8a=''>
<Element 'span' class=('tim-icon',) style='height:36px;width:120px;' data-v-618a07e0='' data-v-da5d12aa=''>
<Element 'span' class=('mobile-contact-box-item__upper-row-text',) data-v-0d3b5296='' data-v-e8d66818=''>
<Element 'span' class=('tim-icon', 'mobile-contact-box-item__phone-icon') style='height:16px;width:16px;' data-v-618a07e0='' data-v-e8d66818=''>
<Element 'span' class=('mobile-contact-box-item__phone-text',) data-v-0d3b5296='' data-v-e8d66818=''>
<Element 'span' class=('mobile-contact-box-item__phone-separator',) data-v-0d3b5296='' data-v-e8d66818=''>
<Element 'span' class=('tim-icon', 'mobile-contact-box-item__phone-icon') style='height:16px;width:16px;' data-v-618a07e0='' data-v-e8d66818=''>
<Element 'span' class=('mobile-contact-box-item__phone-text',) data-v-0d3b5296='' data-v-e8d66818=''>
<Element 'span' class=('tim-icon', 'mobile-contact-box-item__mail-icon') style='height:16px;width:16px;' data-v-618a07e0='' data-v-e8d66818=''>
<Element 'span' class=('mobile-contact-box-item__mail-text',) data-v-0d3b5296='' data-v-e8d66818=''>
<Element 'span' class=('__cf_email__',) data-cfemail='d5a6beb9b0a595a1bcb8fba5b9'>
<Element 'span' class=('panel__label',) data-v-21453778=''>
<Element 'span' class=('panel__label',) data-v-21453778=''>
<Element 'span' class=('panel__label',) data-v-21453778=''>
<Element 'span' class=('ml-1',) data-v-21453778=''>
<Element 'span' class=('icon-arrow-right-before',) data-v-59cb5ae5=''>
<Element 'span' class=('toolkit-tabs__tab-title',) style='display:none;' data-v-b790aafc=''>
<Element 'span' class=('tim-icon', 'toolkit-tabs__close-icon') style='height:23px;width:11px;' data-v-618a07e0='' data-v-b790aafc=''>
<Element 'span' class=('toolkit-tabs__tab-title',) data-v-b790aafc=''>
<Element 'span' class=('tim-icon',) style='height:25px;width:25px;' data-v-618a07e0='' data-v-b790aafc=''>
<Element 'span' class=('toolkit-tabs__product-counter',) style='display:none;' data-v-b790aafc=''>
<Element 'span' class=('toolkit-tabs__tab-title',) data-v-b790aafc=''>
<Element 'span' class=('tim-icon',) style='height:25px;width:25px;' data-v-618a07e0='' data-v-b790aafc=''>
<Element 'span' class=('toolkit-tabs__product-counter',) style='display:none;' data-v-b790aafc=''>

i nadal nie mam pojęcia jak to zrobić.

0

Dziwne, u mnie działa

#!/usr/bin/python3

from requests_html import HTMLSession

session = HTMLSession()

r = session.get('https://www.tim.pl/wyszukiwanie/wyniki/?q=1131-136AA-KN659&p=1')
r.html.render()

for i,g in enumerate(r.html.find('span')):
    print(i,g)

na pozycji 118 jest cena:

...
118 <Element 'span' data-v-0a5c8a27='' title='3,01 zł' class=('product-prices__price', 'product-prices__price--big')>
...
0
jvoytech napisał(a):

Dziwne, u mnie działa

#!/usr/bin/python3

from requests_html import HTMLSession

session = HTMLSession()

r = session.get('https://www.tim.pl/wyszukiwanie/wyniki/?q=1131-136AA-KN659&p=1')
r.html.render()

for i,g in enumerate(r.html.find('span')):
    print(i,g)

na pozycji 118 jest cena:

...
118 <Element 'span' data-v-0a5c8a27='' title='3,01 zł' class=('product-prices__price', 'product-prices__price--big')>
...

Dzisiaj, faktycznie jest tylko na pozycji 109, ale metodą:

e = r.html.xpath('//*[@id="app"]/div/div[3]/div/div[2]/div/div[2]/div/div[2]/div[2]/div/div[2]/div[1]/div/div/span[1]')

print(e[0].attrs)

nie mogę tego wyodrębnić

(zauważyłem też że cena za każdym razem jest na innej pozycji, czy jest sposób żeby to obejść?)

0

Znalazłem sposób na wyodrębnienie ceny, temat zamknięty.

Rozwiązanie:

#!/usr/bin/python3

from requests_html import HTMLSession

session = HTMLSession()
lista=[]
lista1=[]
r = session.get('https://www.tim.pl/wyszukiwanie/wyniki/?q=0001-00000-57481&p=1')
r.html.render()

for item in r.html.xpath("//*[contains(@class,'product-prices__price--big')]"):
    print(item.text)

Dziękuje wszystkim za pomoc :)

1 użytkowników online, w tym zalogowanych: 0, gości: 1