Archive for the ‘code’ Category

I need to concatenate a set of PDFs, I will take you through my standard issue Python development approach when doing something I’ve never done before in Python.

My first instinct was to google for pyPDF. Success! So, fore go reading any doc and just give the old easy_install a try.

$ sudo easy_install pypdf

Another success! Ok, a couple help() calls later and I am ready to go. The end result is surprisingly small and seems to run fast enough even for PDFs with 50+ pages.

from pyPdf import PdfFileWriter, PdfFileReader

def append_pdf(input,output):
    [output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]

output = PdfFileWriter()
append_pdf(PdfFileReader(file("sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("sample.pdf","rb")),output)

output.write(file("combined.pdf","wb"))

So I am playing around in Firefox with XMLHttpRequest. Looking in to a way to facilate a server update to a client without have to refresh the page or use Javascript timers. So the long-live HTTP request seems the way to go.

This little app will at most have 20-30 connections at once, so I am not worried about the open connection per client. The data it calculates is rather large and intensive to gather, so I paired it with the cache decorator snippet found on ActiveState and used in Expert Python Programming. This example feeds a cached datetime string. The caching lets different client receive the same data during the cache process. There is some lag between the updates since they all set their sleep at different points, there may be away around this though.

So here is my basic index.html.

<body>
<em>This will push data from the server to you every 5 seconds .. enjoy!</em>
<ul id="container"></ul>

<script>
var div = document.getElementById('container');
function handleContent(event)
{
  var xml_packet = event.target.responseXML.documentElement;
  div.innerHTML += '<li>' + xml_packet.childNodes[0].data + '</li>';
}
(function () {
    var xrequest = new XMLHttpRequest();
    xrequest.multipart = true;
    xrequest.open("GET","/server/index",false);
    xrequest.onload = handleContent;
    xrequest.send(null);
})();

</script>
</body>

Now the controller code itself.

class ServerController(BaseController):
    def index(self):
        response.headers['Content-type'] = 'multipart/x-mixed-replace;boundary=test'
        return data_stream()

def data_stream(stream=True):
    yield datetime_string()

    while stream:
        time.sleep(5)
        yield datetime_string()

@memorize(duration=15)
def datetime_string():
    content = '--test\nContent-type: application/xml\n\n'
    content += '<?xml version=\'1.0\' encoding=\'ISO-8859-1\'?>\n'
    content += '<message>' + str(datetime.datetime.now()) + '</message>\n'
    content += '--test\n'

    return content

Also the decorator code for good measure.

cache = {}

def is_old(entry, duration):
    return time.time() - entry['time'] > duration

def compute_key(function, args, kw):
    key = pickle.dumps((function.func_name, args, kw))
    return hashlib.sha1(key).hexdigest()

def memorize(duration=10):
    def _memorize(function):
        def __memorize(*args, **kw):
            key = compute_key(function, args, kw)

            if (key in cache and not is_old(cache[key], duration)):
                return cache[key]['value']
            result = function(*args, **kw)
            cache[key] = {'value': result, 'time':time.time()}
            return result
        return __memorize
    return _memorize

Full working demo will be available in the HG repos shortly.

Note to self, when handling CGI file uploads on a Windows machine, you need the following boiler plate to properly handler binary files.

try: # Windows needs stdio set for binary mode.
    import msvcrt
    msvcrt.setmode (0, os.O_BINARY) # stdin  = 0
    msvcrt.setmode (1, os.O_BINARY) # stdout = 1
except ImportError:
    pass

Also, if you’re handling very large files and don’t want to eat up all your memory saving them using the copyfileobj method, you can use a generator to buffer read and write the file.

def buffer(f, sz=1024):
    while True:
        chunk = f.read(sz)
        if not chunk: break
        yield chunk
# then use it like this ...
for chunk in buffer(fp.file)

Looking around for a quick example of doing an HTTP file upload using Python and pycurl and I couldn’t find one. I honestly didn’t look all that hard, having used libcurl C and C++ bindings in the past, I figured I could take a guess at it and get it right. Well, I did. So here it is for your enjoyment.

import pycurl as curl
def main():
    data = [
        ('description', 'File for testing upload via pycurl'),
        ('upload', (curl.FORM_FILE, '/home/wayne/test.txt')),
    ]
    req = curl.Curl()
    req.setopt(curl.URL, 'http://pieceofpy.com/cgi-bin/upload.py')
    req.setopt(curl.HTTPPOST, data)
    req.perform()
    req.close()

if __name__ == '__main__':
    main()

Simple as that. As for handling the file upload in the Python script, take a look at Code: Saving in memory file to disk

Okay, I discovered this today when looking to increase the speed at which uploaded documents were saved to disk. Now, I can’t explain the inner workings of why it is fast(er), all I know is that with the exact same form upload test ran 100 times with a 25MB file over a 100Mbit/s network this method was on average a whole 2.3 seconds faster over traditional methods of write, writelines, etc..

How does this extend to real-world production usage over external networks, well no idea. Though I plan to find out. So you all will be the first to know as soon as I find some guinea pig site that does enough file uploads to implement this on.

# Minus some boiler plate for validity and variable setup.
import os
import shutil
memory_file = request.POST['upload']
disk_file = open(os.path.join(save_folder, save_name),'w')
shutil.copyfileobj(memory_file.file, disk_file)
disk_file.close()