Stripping non printable characters from a string in python

0 votes
asked Sep 18, 2008 by vinko-vrsalovic

I use to run

$s =~ s/[^[:print:]]//g;

on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.

7 Answers

0 votes
answered Sep 18, 2008 by vinko-vrsalovic

The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?

0 votes
answered Sep 18, 2008 by william-keller

As far as I know, the most pythonic/efficient method would be:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)
0 votes
answered Sep 18, 2008 by kirk-strauser

This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))
0 votes
answered Sep 18, 2008 by ants-aasma

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0,32) + range(127,160)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)
0 votes
answered Sep 18, 2008 by ber

You could try setting up a filter using the unicodedata.category() function:

printable = Set('Lu', 'Ll', ...)
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

See the Unicode database character properties for the available categories

0 votes
answered Sep 14, 2014 by shawnrad

In Python 3,

def filter_nonprintable(text):
    import string
    # Get the difference of all ASCII characters from the set of printable characters
    nonprintable = set([chr(i) for i in range(128)]).difference(string.printable)
    # Use translate to remove all non-printable characters
    return text.translate({ord(character):None for character in nonprintable})

See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()

0 votes
answered Sep 15, 2017 by knowingpark

To remove 'whitespace',

import re
t = """
\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>
"""
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))
Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...