How can I safely encode a string in Java to use as a filename?

0 votes
asked Jul 26, 2009 by steve-mcleod

I'm receiving a string from an external process. I want to use that String to make a filename, and then write to that file. Here's my code snippet to do this:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), s);
    PrintWriter currentWriter = new PrintWriter(currentFile);

If s contains an invalid character, such as '/' in a Unix-based OS, then a java.io.FileNotFoundException is (rightly) thrown.

How can I safely encode the String so that it can be used as a filename?

Edit: What I'm hoping for is an API call that does this for me.

I can do this:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), URLEncoder.encode(s, "UTF-8"));
    PrintWriter currentWriter = new PrintWriter(currentFile);

But I'm not sure whether URLEncoder it is reliable for this purpose.

9 Answers

0 votes
answered Jan 26, 2009 by burkhard

You could remove the invalid chars ( '/', '\', '?', '*') and then use it.

0 votes
answered Jul 26, 2009 by cletus

My suggestion is to take a "white list" approach, meaning don't try and filter out bad characters. Instead define what is OK. You can either reject the filename or filter it. If you want to filter it:

String name = s.replaceAll("\\W+", "");

What this does is replaces any character that isn't a number, letter or underscore with nothing. Alternatively you could replace them with another character (like an underscore).

The problem is that if this is a shared directory then you don't want file name collision. Even if user storage areas are segregated by user you may end up with a colliding filename just by filtering out bad characters. The name a user put in is often useful if they ever want to download it too.

For this reason I tend to allow the user to enter what they want, store the filename based on a scheme of my own choosing (eg userId_fileId) and then store the user's filename in a database table. That way you can display it back to the user, store things how you want and you don't compromise security or wipe out other files.

You can also hash the file (eg MD5 hash) but then you can't list the files the user put in (not with a meaningful name anyway).

EDIT:Fixed regex for java

0 votes
answered Jul 26, 2009 by stephen-c

If you want the result to resemble the original file, SHA-1 or any other hashing scheme is not the answer. Instead you want something like this.

char fileSep = '/'; // ... or do this portably.
char escape = '%'; // ... or some other legal char.
String s = ...
int len = s.length();
StringBuilder sb = new StringBuilder(len);
for (int i = 0; i < len; i++) {
    char ch = s.charAt(i);
    if (ch < ' ' || ch >= 0x7F || ch == fileSep || ... // add other illegal chars
        || (ch == '.' && i == 0) // we don't want to collide with "." or ".."!
        || ch == escape) {
        sb.append(escape);
        if (ch < 0x10) {
            sb.append('0');
        }
        sb.append(Integer.toHexString(ch));
    } else {
        sb.append(ch);
    }
}
File currentFile = new File(System.getProperty("user.home"), sb.toString());
PrintWriter currentWriter = new PrintWriter(currentFile);

This solution gives a reversible encoding (with no collisions) where the encoded strings resemble the original strings in most cases. I'm assuming that you are using 8-bit characters.

URLEncoder has the disadvantage that it encodes a whole lot of legal file name characters.

If you want a not-guaranteed-to-be-reversible solution, then simply remove the 'bad' characters rather than replacing them with escape sequences.

0 votes
answered Jan 23, 2013 by david-fischer

Simply use :

IOHelper.toFileSystemSafeName("Iblabla/blabla");

Will turn into "Iblablablabla"

0 votes
answered Jan 26, 2013 by hd1

Pick your poison from the options presented by commons-codec, example:

String safeFileName = DigestUtils.sha(filename);
0 votes
answered Jul 23, 2014 by sharkalley

For those looking for a general solution, these might be common critera:

  • The filename should resemble the string.
  • The encoding should be reversible where possible.
  • The probability of collisions should be minimized.

To achieve this we can use regex to match illegal characters, percent-encode them, then constrain the length of the encoded string.

private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\\-]");

private static final int MAX_LENGTH = 127;

public static String escapeStringAsFilename(String in){

    StringBuffer sb = new StringBuffer();

    // Apply the regex.
    Matcher m = PATTERN.matcher(in);

    while (m.find()) {

        // Convert matched character to percent-encoded.
        String replacement = "%"+Integer.toHexString(m.group().charAt(0)).toUpperCase();

        m.appendReplacement(sb,replacement);
    }
    m.appendTail(sb);

    String encoded = sb.toString();

    // Truncate the string.
    int end = Math.min(encoded.length(),MAX_LENGTH);
    return encoded.substring(0,end);
}

Patterns

The pattern above is based on a conservative subset of allowed characters in the POSIX spec.

If you want to allow the dot character, use:

private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\\-\\.]");

Just be wary of strings like "." and ".."

If you want to avoid collisions on case insensitive filesystems, you'll need to escape capitals:

private static final Pattern PATTERN = Pattern.compile("[^a-z0-9_\\-]");

Or escape lower case letters:

private static final Pattern PATTERN = Pattern.compile("[^A-Z0-9_\\-]");

Rather than using a whitelist, you may choose to blacklist reserved characters for your specific filesystem. E.G. This regex suits FAT32 filesystems:

private static final Pattern PATTERN = Pattern.compile("[%\\.\"\\*/:<>\\?\\\\\\|\\+,\\.;=\\[\\]]");

Length

On Android, 127 characters is the safe limit. Many filesystems allow 255 characters.

If you prefer to retain the tail, rather than the head of your string, use:

// Truncate the string.
int start = Math.max(0,encoded.length()-MAX_LENGTH);
return encoded.substring(start,encoded.length());

Decoding

To convert the filename back to the original string, use:

URLDecoder.decode(filename, "UTF-8");

Limitations

Because longer strings are truncated, there is the possibility of a name collision when encoding, or corruption when decoding.

0 votes
answered Jan 13, 2015 by voho

This is probably not the most effective way, but shows how to do it using Java 8 pipelines:

private static String sanitizeFileName(String name) {
    return name
            .chars()
            .mapToObj(i -> (char) i)
            .map(c -> Character.isWhitespace(c) ? '_' : c)
            .filter(c -> Character.isLetterOrDigit(c) || c == '-' || c == '_')
            .map(String::valueOf)
            .collect(Collectors.joining());
}

The solution could be improved by creating custom collector which uses StringBuilder, so you do not have to cast each light-weight character to a heavy-weight string.

0 votes
answered Jul 1, 2015 by bullywiiplaza

Try using the following regex which replaces every invalid file name character with a space:

public static String toValidFileName(String input)
{
    return input.replaceAll("[:\\\\/*\"?|<>']", " ");
}
0 votes
answered Jul 10, 2016 by jonascz

Here's what I use:

public String sanitizeFilename(String inputName) {
    return inputName.replaceAll("[^a-zA-Z0-9-_\\.]", "_");
}

What this does is is replace every character which is not a letter, number, underscore or dot with an underscore, using regex.

This means that something like "How to convert £ to $" will become "How_to_convert___to__". Admittedly, this result is not very user-friendly, but it is safe and the resulting directory /file names are guaranteed to work everywhere. In my case, the result is not shown to the user, and is thus not a problem, but you may want to alter the regex to be more permissive.

Worth noting that another problem I encountered was that I would sometimes get identical names (since it's based on user input), so you should be aware of that, since you can't have multiple directories / files with the same name in a single directory. Also, you may need to truncate or otherwise shorten the resulting string, since it may exceed the 255 character limit some systems have.

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...