1
\$\begingroup\$

I wrote a Python script for Web scraping of a Website. Please review my code and suggest me any changes or make me aware of my blunders/mistakes?. I wrote the almost same script for other websites also so please can you suggest me a way to combine all other scripts into one script so that I can get one single [merged] file.

import requests import json import pandas as pd def geturl(): urls = [ # List of URLs ] main = [] id = 0 for url in urls: r = requests.get(url) data = json.loads(r.content) print(r.status_code) items = data['items'] baseurl = #URL for item in items: data = {} data['id']= id data['Title'] = item['name'] data["Price"] = item['price'] data['Detai Page'] =baseurl+item['slug'] data['Image'] = item['thumb_image'] id += 1 main.append(data) sr= pd.Series(main) sr.to_json('data.json', orient='records') geturl() 
\$\endgroup\$
2
  • 1
    \$\begingroup\$Are you able to share any real URLs, or at least markup from the real pages?\$\endgroup\$CommentedMar 21, 2022 at 18:15
  • \$\begingroup\$This has nothing to do with functional programming, web scraping or BeautifulSoup.\$\endgroup\$CommentedMar 21, 2022 at 21:58

2 Answers 2

3
\$\begingroup\$
  • When using Requests to fetch JSON content, it's usually more convenient to use the resonse.json() method instead of manually passing response.content to json.loads.

  • As far as I can tell, you're using Pandas to turn a dictionary into a JSON string and write it to a file - as far as I can tell, the structure doesn't change. That seems to me like a strange reason to use Pandas. It seems to me like you could get the same results in a simpler way by doing something like

    with open("data.json", "w") as output_file: json.dump(main, output_file) 
  • That said, I somewhat dislike not having the ability to just fetch the data without also writing it to a file. Personally I'd have a separate function to fetch the data, which could in turn be called by the larger geturl function.

  • Hard-coding URLs and output paths makes your function less reusable than it could be. Consider having it take such things as parameters instead.

  • Right now, the last line of your script calls geturl not just whenever the script itself is run, but also whenever this file is imported as part of a larger program. That's awkward, since I'm sure this function might be useful elsewhere too. You can avoid having the code run when imported by putting it in an if __name__ == '__main__' block.

  • Creating data as an empty dict and adding elements one by one is a fine approach. However, for dicts like this one, which are small, and have consistent structures and simple contents, I often find it neater to just make it all at once with a single dict literal. But that's a matter of taste, and your current approach works just fine.

  • Re-using the name of a built-in function (such as id) for another variable will work, but is generally not considered great practice - I'd suggest renaming the id variable for that reason.

Put that all together, you might end up with something like:

import requests import json def get_item_data(urls, detail_base_url): main = [] item_id = 0 for url in urls: r = requests.get(url) data = r.json() for item in data['items']: item_data = { 'id': item_id 'Title': item['name'] 'Price': item['price'] 'Detai Page': detail_base_url + item['slug'] 'Image': item['thumb_image'] } id += 1 main.append(item_data) return main def save_item_data(urls, detail_base_url, output_filename): item_data = get_item_data(urls, detail_base_url) with open(output_filename, 'w') as output_file: json.dump(item_data, output_file) if __name__ == '__main__': # TODO: We might want to get these from the command line. The "argparse" module would be useful for that urls = [ ... ] detail_base_url = ... output_file_name = 'data.json' save_item_data(urls, detail_base_url, output_file_name) 
\$\endgroup\$
    -3
    \$\begingroup\$

    When you work with requests and json transformations you should allways wrap that code on a try except block

    r = requests.get(url) data = json.loads(r.content) print(r.status_code) 

    Should be rewrite to

    try: response = requests.get(url) if response.ok: data = response.json() except (requests.exceptions.RequestException, json.decoder.JSONDecodeError) as exc: print(exc) 

    Those are the minimal changes that you need to do when you work with APIs and use json transformations and requests library.

    Hope it helps

    \$\endgroup\$
    1
    • 4
      \$\begingroup\$This is... not right. Universally catching and printing exceptions is the opposite of good practice.\$\endgroup\$CommentedMar 22, 2022 at 13:05

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.