PyQuery over BeautifulSoup

I was trying to scrape some search data the other day, and I ran into some malformed tag issues with BeautifulSoup, which is what I’ve used in the past. I didn’t really need much “power-scraping” though, just wanted to collect text from div tags across a few pages of search results. Also, I think BeautifulSoup isn’t maintained much anymore, so I started looking for something different.

Then I found PyQuery, which is great; I guess I’m new to it. For simple scraping, it’s a great way to go; the point of PyQuery is to make jquery queries on xml; I found it to be pretty handy for getting what I wanted pretty quickly. I wanted to pull text so I did:

>>> d = pq(myURL)
>>> myText = d(‘div.sampleClass’) # this’ll give you every div that you want of sampleClass
>>> myText.eq(‘0′).text()
‘Hello this is sample text 0!’
>>> myText.eq(‘1′).text()
‘Hello this is sample text 1!’

So if you’re trying to pull text across search pages like I was, and the content is structured enough, you can just loop through every URL, and then loop further through every instance of “sampleClass”, saving your text along the way.

I’m new to this (and new to python in general, to be quite honest), so I found it pretty cool! Please let me know if you have any other tools or PyQuery specific tricks that could be interesting.

Share and Enjoy:

This post was written by:

Azeem Ansar - Azeem writes at Azeem Ansar - Tech | Data | Thoughts - azeem.fm

You can follow me on twitter at @azeemansar, or feel free to email me directly.


About this entry