Using Selenium with Python for Web Crawling

This is work in progress.
Selenium can be used as emulated web browser. PyVirtualDisplay emulates display. This 2 together can be used to
get information from web sites.
Installation on Ubuntu:
sudo easy_install Selenium
Install dependencies for PyVirtualDisplay on Ubuntu:
sudo apt-get install -y --force-yes xvfb xserver-xephyr
Install PyVirtualDisplay:
sudo easy_install pyvirtualdisplay
Here is an example

def main():
  # instantiate and start virtual display
  display = Display(visible=0, size=(1024, 768))
  display.start()

  total_scrolls_per_page = 10

  # output file with scraped words will be here
  file_volley_fans_temp = open(OUT_FILE_STR, 'w+')
  log = open(LOG_FILE_STR, 'w')

  # instantiate virtual browser
  browser = webdriver.Firefox()

  # load page
  browser.get(GROUP_PAGE_HTTP_STR)
  time.sleep(2) #seconds

  scroll_bottom(browser, total_scrolls_per_page)
  scrap_page(file_volley_fans_temp, log, browser)

  # flush buffers
  log.flush()
  file_volley_fans_temp.flush()
  browser.quit()

  # remove "garbage" characters from output file using sed
  keep_words_only()

  # put all words into a single column
  split_words()
  pass
Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s