5
\$\begingroup\$

I've written a script in scrapy to grab different names and links from different pages of a website and write those parsed items in a csv file. When I run my script, I get the results accordingly and find a data filled in csv file. I'm using python 3.5, so when I use scrapy's built-in command to write data in a csv file, I do get a csv file with blank lines in every alternate row. Eventually, I tried the below way to achieve the flawless output (with no blank lines in between). Now, It produces a csv file fixing blank line issues. I hope I did it in the right way. However, if there is anything I can/should do to make it more robust, I'm happy to cope with.

This is my script which provides me with a flawless output in a csv file:

import scrapy ,csv from scrapy.crawler import CrawlerProcess class GetInfoSpider(scrapy.Spider): name = "infrarail" start_urls= ['http://www.infrarail.com/2018/exhibitor-profile/?e={}'.format(page) for page in range(65,70)] def __init__(self): self.infile = open("output.csv","w",newline="") def parse(self, response): for q in response.css("article.contentslim"): name = q.css("h1::text").extract_first() link = q.css("p a::attr(href)").extract_first() yield {'Name':name,'Link':link} writer = csv.writer(self.infile) writer.writerow([name,link]) c = CrawlerProcess({ 'USER_AGENT': 'Mozilla/5.0', }) c.crawl(GetInfoSpider) c.start() 

Btw, I used .CrawlerProcess() to be able to run my spider from sublime text editor.

\$\endgroup\$
1
  • \$\begingroup\$Welcome to Code Review! That's quite a well-written question, especially for a new user. Well done.\$\endgroup\$
    – Mast
    CommentedJun 30, 2018 at 8:00

3 Answers 3

3
\$\begingroup\$

I'd like to mention, that there is a special way of making output files in scrapy - item pipelines. So, in order to make it right, you should write your own pipeline (or modify standard one via subclassing).

Also, you does not close the file, once you're done and you keep it open most of the time. The both problems are handled nicely with pipelines.

UPD: Well, you've asked for a better way, there it is. Although, if it's not acceptable for some hard-to-explain reasons (it's understandable), here's the other approach, how to make it better.

  1. Don't leave the file open. There is a method (__del__()) which destroys the spider object. Add the code to close the file before it.

  2. Another one is to store only the filename in the variable and open / close the file each time you write into it.

  3. Another option is to use NoSQL database, which does not need to be opened / closed. And after scraping is done - get the output file from it.

  4. If you have a few values to scrape you can store it in object variable, and then export it before __del__() method.

All the ways above are NOT welcomed by the actual developer community and may lead to serious problem in future. Use them carefully. Sometimes it's easier (in the long run) to read and understand how it really should be done.

Maybe it's the exact case?

\$\endgroup\$
4
  • \$\begingroup\$I think you put those lines wrongly in the answer section whereas it should be in the comment. If it were for item pipelines, I would not make the title of my post using this customized keyword. Thanks.\$\endgroup\$
    – SIM
    CommentedJun 29, 2018 at 21:01
  • \$\begingroup\$Actually, I cannot comment other's posts yet\$\endgroup\$CommentedJun 29, 2018 at 21:04
  • \$\begingroup\$@asmitu Is there a reason you went for a customized approach? As in, why is the approach suggested not acceptable for you?\$\endgroup\$
    – Mast
    CommentedJun 30, 2018 at 8:04
  • \$\begingroup\$Check out this link to be sure as to why __del__() method should be avoided.\$\endgroup\$
    – SIM
    CommentedJun 30, 2018 at 14:44
1
\$\begingroup\$

You should ensure that the file is closed. In addition you should avoid creating a new writer object every loop iteration using the with statement:

class GetInfoSpider(scrapy.Spider): name = "infrarail" start_urls= ['http://www.infrarail.com/2018/exhibitor-profile/?e={}'.format(page) for page in range(65,70)] output = "output.csv" def __init__(self): # empty outputfile open(self.output, "w").close() # alternative: # if os.path.isfile(self.output): # os.remove(self.output) def parse(self, response): with open(self.output, "a", newline="") as f: writer = csv.writer(f) for q in response.css("article.contentslim"): name = q.css("h1::text").extract_first() link = q.css("p a::attr(href)").extract_first() writer.writerow([name, link]) yield {'Name': name, 'Link': link} 

Note that I also added some spaces after commas to improve readability, according to Python's official style-guide, PEP8.

It also recommends only importing from one module per line (so while from random import rand, randint is fine, import scrapy, csv is not).

Also note that each item is only written to file when the next one is being requested, as a generator pauses after the yield. That means if you for example itertools.islice it, your last item won't be written to the file. Therefore I swapped those two lines.

\$\endgroup\$
1
  • \$\begingroup\$@asmitu That will still leave the file open. In addition if you ever have more than one GetInfoSpider object, they will have the same file object, because the file is actually opened at the time of the class definition, not the instance creation. See my updated answer for two ways how to use append mode and still make sure that the file is overwritten for each new run.\$\endgroup\$
    – Graipher
    CommentedJun 30, 2018 at 11:36
1
\$\begingroup\$

You should opt for closed() method as I've tried below. This method will be called automatically once your spider is closed. This method provides a shortcut to signals.connect() for the spider_closed signal.

class InfraRailSpider(scrapy.Spider): name = "infrarail" start_urls = ['https://www.infrarail.com/2020/english/exhibitor-list/2018/'] def __init__(self): self.outfile = open("output.csv", "w", newline="") self.writer = csv.writer(self.outfile) self.writer.writerow(['title']) print("***"*20,"opened") def closed(self,reason): self.outfile.close() print("***"*20,"closed") def parse(self, response): for item in response.css('#exhibitor_list > [class^="e"]'): name = item.css('p.basic > b::text').get() self.writer.writerow([name]) yield {'name':name} 
\$\endgroup\$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.