Search in a UTF-16 encoded file
I got a CSV file from one service, and want to search some word in this file, but I got something wrong when read lines in Python
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
But I can open this file with Microsoft Excel, it looks fine.
Finally I found the start of file is a signature,
\xFFFE, after Google, this is a BOM (byte order mark), it is similar with UTF-8 BOM(
EF,BB,BF). It means it is encoded by UTF-16, and it is UTF-16-LE, little-endian.
There are some methods to process this file, they are similar.
open with encoding
f = open('the-file.csv', encoding='utf-16-le') lines = f.readlines()
decoding by yourself
f = open('ths-file.csv', 'rb') data = f.read().decode('utf-16-le')
After read file to memory, I found it is unexpected, even I encoded Unicode to UTF-8.
\ufeffL\x00i\x00v\x00e\x00 \x00B\x00a\x00s\x00i\x00c\x00 \x00D\x00a\x00t\x00a\x00\t\x00\n
This is a Unicode string, since we have decoded by UTF-16, and in Python3, all
str are Unicode.
Live Basic Data
I found an answer on Stackoverflow, he said UTF-16 use two bytes to encode a character, so if the content is ASCII, another
\x00 will be followed each characters.
After I removed all
\x00, it looks fine.