How to debug Scrapy scripts using PyCharm

Fatime Selimi
6 min readJun 30, 2018

--

Recently I started using this powerful scraping framework called Scrapy, and I love it so far. For this tutorial I am using the script that scrapes all job listings from Craiglist (New York). The script can be found in my github profile.
However, during the process of scraping the data I wanted, phone numbers precisely, I was getting some weird results, i.e a lot of false positives. I was intrigued by that, and wanted to know what was happening under the hood, and the idea of utilizing the power of debugging for this purpose popped up into my mind.
The first thing I did, of course, was to google about the way I can do that. After I stumbled upon numerous links from the searching results, I never really found something that satisfied my needs.

So, I had to go through the process of finding a way I can configure the debugger to do the thing, and I wanted to share with other folks what I learned in the hard but exciting way.

Tools I used:

Python 3.6.5

Scrapy 1.5

PyCharm Community Edition

Since we are going to debug the Scrapy spider, we need to know the location of Scrapy framework on our machine. This is quite easy. We just have to write this simple command on the terminal and save the result, since we will need it later.

which scrapy
Command to find the location of Scrapy Framework

First of all, we start off by putting a breakpoint in one or more lines of our code. Putting a breakpoint in PyCharm is pretty easy. You just have to click aside the line of code you want to put it.

Put a breakpoint in some line of code

After we did the first step, we can continue to the next step which is configuring the debugger.

To configure the debugger, find the Run tab at the toolbar of PyCharm editor and then click Debug.

After clicking the Debug option, a small window will pop out. We can see the ‘Edit Configurations…’ option in the Debug window. We should click it, in order to configure the debugger.

After we click on ‘Edit Configurations…’ option, the Run/Debug Configurations window will open. In the left corner of this window we can see a bunch of options, containing the one (+) that we’re looking for, as well. We should click at the plus sign (+) after which we’ll see a drop down menu having options like ‘Compound, Jupyter Notebook, Python, Python docs, Python tests, Tox’. We should click ‘Python’ option.

After we click on Python option, finally the configuration window will open and we’ll finally start configuring our debugger.

First of all, we can see a field where we can write down the name of the debugger. In this tutorial we will leave it as it is by default (Unnamed).

At the ‘Configuration’ tab of this window we have four steps through which we need to go:

  1. Copy the result you get after running ‘which scrapy’ command on the terminal. This is exactly the same step we did at the very beggining.
    Paste the result at the field for ‘Script:’ configuration.
    In my case the result is:
/usr/local/bin/scrapy

2. Write the following command at the field for ‘Script parameters:’

crawl <name of spider> -o any_name.csv

3. Select the same Python interpreter that you are using in your project. In my case is Python 3.6.5.
You can find all the Python interpreters by clicking in the dropdown button next to the corresponding field.

4. Choose the correct working directory, i.e the directory where your project and spider/s that you want to debug are located.

After we did all this configuration, we are good to go. Just click the Debug button at the bottom of the current window (Debug — Unnamed) and the debugger will start.

Since we configured and started the debugger, we can utilize the power of debugging to observe step by step what is happening under the hood, once we run the code.

The buttons shown below, framed with the red box have functions as follows:

  1. Show Execution Point —Click this button to highlight the current execution point in the editor and show the corresponding stack frame in the Frames pane.
  2. Step Over — Click this button to execute the program until the next line in the current method or file, skipping the methods referenced at the current execution point (if any). If the current line is the last one in the method, execution steps to the line executed right after this method.
  3. Step Into — Click this button to have the debugger step into the method called at the current execution point.
  4. Step Into My Code — Click this button to skip stepping into library sources and keep focused on your own code.
  5. Force Step Into — The Force Step Into command enables you to step into a method of a class not to be stepped into.
  6. Step Out — Click this button to have the debugger step out of the current method, to the line executed right after it.
  7. Run the Cursor — Click this button to resume program execution and pause until the execution point reaches the line at the current cursor location in the editor. No breakpoint is required.
  8. Evaluate Expression — Click this button to open the Evaluate Expression dialog.

At the bottom of the PyCharm editor, we can see the Debug window.
At the ‘Frames’ window we can see how the code was executed, i.e the hierarchy of classes and functions until it reached the part of code we are trying to debug.
It has the ‘Variables’ window also, which we can expand and see what we are getting there.

We can use the ‘Evaluate’ button, the last button from the above mentioned buttons, to see how a particular part of the code is behaving.

In this case if we write ‘response.body’ and click ‘Evaluate’, the body of a particular url will be fetched and we can view it by clicking ‘View’ button on the left to see the results we’re getting.

We can also use the power of breakpoints to configure them the way we want. We can put conditions in the breakpoints so that we can see the execution of the intended part of code only when the condition is met.

By right clicking on the breakpoint, a window will open and will allow us to put a condition in the ‘Condition:’ field. Click ‘Done’, and after that you will see a tiny question mark close to that breakpoint.

Conclusion

As you can see, it is pretty easy to configure a debugger in PyCharm and very useful indeed. Nevertheless, the easiest things sometimes are the most neglected, and that is why I wanted to contribute to the already-existing large body of knowledge in scraping, and help others, especially newbies in programming.

Hope you enjoyed it, and found it helpful in your work.

--

--

Fatime Selimi
Fatime Selimi

Written by Fatime Selimi

Passionate about building worlds in code

Responses (2)