Adding BOM to UTF-8 files

0 votes
asked Jun 27, 2010 by stephane

I'm searching (without success) a script, which would work as a batch file and allow me to prepend a UTF-8 text file with a BOM if it doesn't have one.

Neither the language it is written in (perl, python, c, bash) or the OS it works on matters to me. I have access to a wide range of computers.

I've found a lot of script to do the reverse (strip the BOM), which sounds to me as kind of silly, as many Windows program will have trouble reading UTF-8 text files if they don't have a BOM.

Did I miss the obvious? Thanks!

6 Answers

0 votes
answered Jan 27, 2010 by luiscubal

I find it pretty simple. Assuming the file is always UTF-8(you're not detecting the encoding, you know the encoding):

Read the first three characters. Compare them to the UTF-8 BOM sequence(wikipedia says it's 0xEF,0xBB,0xBF). If it's the same, print them in the new file and then copy everything else from the original file to the new file. If it's different, first print the BOM, then print the three characters and only then print everything else from the original file to the new file.

In C, fopen/fclose/fread/fwrite should be enough.

0 votes
answered Jun 20, 2010 by steven-r-loomis

I wrote this addbom.sh using the 'file' command and ICU's 'uconv' command.

#!/bin/sh

if [ $# -eq 0 ]
then
        echo usage $0 files ...
        exit 1
fi

for file in "$@"
do
        echo "# Processing: $file" 1>&2
        if [ ! -f "$file" ]
        then
                echo Not a file: "$file" 1>&2
                exit 1
        fi
        TYPE=`file - < "$file" | cut -d: -f2`
        if echo "$TYPE" | grep -q '(with BOM)'
        then
                echo "# $file already has BOM, skipping." 1>&2
        else
                ( mv "${file}" "${file}"~ && uconv -f utf-8 -t utf-8 --add-signature < "${file}~" > "${file}" ) || ( echo Error processing "$file" 1>&2 ; exit 1)
        fi
done

edit: Added quotes around the mv arguments. Thanks @DirkR and glad this script has been so helpful!

0 votes
answered Jan 4, 2013 by ccpizza

I thought I won't have to write such a trivial thing myself, but since I also needed to do some charset conversion, here it is:

#!/usr/bin/python
import os
import sys
import codecs

INPUT_ENCODING = codecs.BOM_UTF16_LE  # 'utf_16_le'
OUTPUT_ENCODING = 'utf-8-sig'         # is there a constant for this??

if len(sys.argv) == 1:
    print 'Usage:\n\t%s <filename.txt>' % sys.argv[0]
    sys.exit(-1)

output_file = os.path.splitext(os.path.split(sys.argv[1])[-1])[0]
fin = codecs.open(sys.argv[1], 'rb', encoding=INPUT_ENCODING)
fout = codecs.open(output_file + '_utf8bom.txt', 'wb', encoding=OUTPUT_ENCODING)
fout.write(fin.read())
fin.close()
fout.close()

print 'done'

Call it with the the original file name only, i.e.:

# utf8bom_add.py myfilename.txt

And if you are converting UTF-8 to UTF-8 them change the INPUT_ENCODING to the correct value.

0 votes
answered Jan 23, 2014 by vdragon
0 votes
answered Jan 4, 2016 by franklin-piat

(Answer based on https://stackoverflow.com/a/9815107/1260896 by yingted)

To add BOMs to the all the files that start with "foo-", you can use sed. sed has an option to make a backup.

sed -i '1s/^\(\xef\xbb\xbf\)\?/\xef\xbb\xbf/' foo-*

If you know for sure there is no BOM already, you can simplify the command:

sed -i '1s/^/\xef\xbb\xbf/' foo-*

Make sure you need to set UTF-8, because i.e. UTF-16 is different (otherwise check How can I re-add a unicode byte order marker in linux?)

0 votes
answered Jun 24, 2016 by yaron-u

The easiest way I found for this is

#!/usr/bin/env bash

#Add BOM to the new file
printf '\xEF\xBB\xBF' > with_bom.txt

# Append the content of the source file to the new file
cat source_file.txt >> with_bom.txt

I know it uses an external program (cat)... but it will do the job easily in bash

Tested on osx but should work on linux as well

NOTE that it assumes that the file doesn't already have BOM (!)

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...