Scraping javascript website and script tags using python

Question

I am trying to scrape a javascript web page. Having read some of the posts I managed to write the following:

from bs4 import BeautifulSoup import requests website_url = requests.get('https://ec.europa.eu/health/documents/community-register/html/reg_hum_atc.htm').text soup= BeautifulSoup(website_url,'lxml') print(soup.prettify())

and recover the following scripts as follows:

soup.find_all('script')[3]

which gives:

<script type="text/javascript"> // Initialize script parameters. var exportTitle ="Centralised medicinal products for human use by ATC code"; // Initialise the dataset. var dataSet = [ {"id":"A","parent":"#","text":"A - Alimentary tract and metabolism"}, {"id":"A02","parent":"A","text":"A02 - Drugs for acid related disorders"}, {"id":"A02B","parent":"A02","text":"A02B - Drugs for treatment of peptic ulcer"}, {"id":"A02BC","parent":"A02B","text":"A02BC - Proton pump inhibitors"}, {"id":"A02BC01","parent":"A02BC","text":"A02BC01 - omeprazole"}, {"id":"ho15861","parent":"A02BC01","text":"Losec and associated names (referral)","type":"pl"}, ... {"id":"h154","parent":"V09IA05","text":"NeoSpect (withdrawn)","type":"pl"}, {"id":"V09IA09","parent":"V09IA","text":"V09IA09 - technetium (<sup>99m</sup>Tc) tilmanocept"}, {"id":"h955","parent":"V09IA09","text":"Lymphoseek (active)","type":"pl"}, {"id":"V09IB","parent":"V09I","text":"V09IB - Indium (<sup>111</sup>In) compounds"}, {"id":"V09IB03","parent":"V09IB","text":"V09IB03 - indium (<sup>111</sup>In) antiovariumcarcinoma antibody"},{"id":"h025","parent":"V09IB03","text":"Indimacis 125 (withdrawn)","type":"pl"}, ... ]; </script>

Now the problem that I am facing is to apply .text() to soup.find_all('script')[3] and recover a json file from that. When I try to apply .text(), the result is an empty string: ''.

So my question is: why is that? Ideally I would like to end up with:

A02BC01 Losec and associated names (referral) ... V09IA05 NeoSpect (withdrawn) V09IA09 Lymphoseek V09IB03 Indimacis 125 (withdrawn) ...

Alin Stelian · Accepted Answer · 2020-09-21 12:09:52Z

Firstly, you get the text and after that, some string processing - get all the text after 'dataSet = ' and remove the last ';' to have a beautiful JSON array. At the end to process the JSON array in small jsons and print the data.

data = soup.find_all("script")[3].string dataJson = data.split('dataSet = ')[1].split(';')[0] jsonArray = json.loads(dataJson) for jsonElement in jsonArray: print(jsonElement['parent'], end=' ') print(jsonElement['text'])

Thank you, this works perfectly! I just missed the fact that I should use string method rather than .text. The latter does not work but cannot figure out why — Lusian, CommentedSep 21, 2020 at 13:49

Collectives™ on Stack Overflow

Scraping javascript website and script tags using python

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related