RSS category feeds

RSS site feeds

More

Grab text or source from HTML pages PDF Print E-mail
 
import win32com.client
from time import sleep
 
def download_url_with_ie(url):
    """
    Given a url, it starts IE, loads the page, gets the HTML.
    Works only in Win32 with Python Win32com extensions enabled.
    Needs IE. Why? If you’re forced to work with Brain-dead 
    closed sourceapplications that go to tremendous length to deliver
    output specific to browsers; and the application has no interface
    other than a browser; and you want get data into a CSV or XML
    for further analysis;
    Note: IE internally formats all HTML to stupid mixed-case, no-
    quotes-around-attributes syntax. So if you are planning to parse
    the data, make sure you study the output of this function rather
    than looking at View-source alone.
    """
 
    #if you are calling this function in a loop, it is more
    #efficient to open ie once at the beginning, outside this
    #function and then use the same instance to go to url’s
    ie = win32com.client.Dispatch("InternetExplorer.Application")
 
    ie.Visible = 1 #make this 0, if you want to hide IE window
    #IE started
    ie.Navigate(url)
    #it takes a little while for page to load. sometimes takes 5 sec.
    if ie.Busy:
        sleep(5)
    #now, we got the page loaded and DOM is filled up
    #so get the text
    text = ie.Document.body.innerHTML
    #text is in unicode, so get it into a string
    text = unicode(text)
    text = text.encode('ascii','ignore')
    #save some memory by quitting IE! **very important** 
    ie.Quit()
    #return text
    print text
download_url_with_ie('http://www.goermezer.de')
Last Updated ( Tuesday, 28 April 2009 )
 
< Prev   Next >

Feedback

Advertisement

Comments

  • I have run your code on my own machine, but I just got the last paragraph of the... More...
  • Hello Gabriel, It seem that the linked site moved to github.com/.../weboutlook (... More...
  • There is no script to download, could you provide the right link More...
  • Interesting but doesn't really help me. I do not see what the 'xxxxx' in the Ses... More...
  • You can see more python editor comparison: sparkledge.com/.../ (http://sparkledg... More...

Login Form






Lost Password?

My prefered Python IDE

My prefered Python editor is Pyscripter from MMExperts. It is not only an editor. Pyscripter is a full Python IDE including (remote) debugging, a class browser, and all other nice helpers which a full featured IDE needs.

Do you have a script for me ?

Do you have an interesting Python script which does some really cool thing on Windows ? Please post them to this site. It`s very simple - simply copy&paste it to this form. No login is requiered.

Hint: For syntax highlighting and correct Python intendation place your code between html tags <pre> and </pre>.

My prefered web framework

My prefered web framework for developing web applications is Django. Django calls itself The web framework for perfectionists with deadlines. It is a really fast, scalable and (thanks Python) the sexiest web framework of the world.