I am attempting to work with a very large dataset that has some non-standard characters in it. I need to use unicode, as per the job specs, but I am baffled. (And quite possibly doing it all wrong.)
I open the CSV using:
15 ncesReader = csv.reader(open('geocoded_output.csv', 'rb'), delimiter='t', quotechar='"')
Then, I attempt to encode it with:
name=school_name.encode('utf-8'), street=row[9].encode('utf-8'), city=row[10].encode('utf-8'), state=row[11].encode('utf-8'), zip5=row[12], zip4=row[13],county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
I’m encoding everything except the lat and lng because those need to be sent out to an API. When I run the program to parse the dataset into what I can use, I get the following Traceback.
Traceback (most recent call last):
File "push_into_db.py", line 80, in <module>
main()
File "push_into_db.py", line 74, in main
district_map = buildDistrictSchoolMap()
File "push_into_db.py", line 32, in buildDistrictSchoolMap
county=row[25].encode('utf-8'), lat=row[22], lng=row[23])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
I think I should tell you that I’m using python 2.7.2, and this is part of an app build on django 1.4. I’ve read several posts on this topic, but none of them seem to directly apply. Any help will be greatly appreciated.
You might also want to know that some of the non-standard characters causing the issue are Ñ and possibly É.
Иногда на нашем сервере выскакивает следующая ошибка:
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’u200e’ in position 13: ordinal not in range(128)
Ошибка: порядковый номер вне диапазона (128)
Причина: это ошибка, вызванная проблемой с кодировкой китайских символов в Python, в основном вызванной символом u200e
естьУправляющие символы обозначают надписи слева направо, Это не пробел, полностью невидимый, символ без ширины, мы обычно не видим его на веб-страницах.
аналогичен управляющим символам формата Unicode, таким как «писать метку справа налево» ( u200F) и «писать метку слева направо» ( u200E), нулевая ширина Соединитель ( u200D) и не-коннектор нулевой ширины ( uFEFF) управляют визуальным отображением текста, что важно для правильного отображения некоторых неанглийских текстов.
Решение: добавьте следующий блок операторов в заголовок файла, в котором расположен код Python.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Если вы добавите приведенный выше блок кода, чтобы представить проблему сбоя функции печати в python,Затем замените приведенный выше блок кода следующим блоком кода
import sys # здесь просто ссылка на sys, перезагружается только перезагрузка
stdi,stdo,stde=sys.stdin,sys.stdout,sys.stderr
reload(sys) # При ссылке при импорте,Функция setdefaultencoding удаляется после вызова системой, поэтому ее необходимо перезагрузить один раз.
sys.stdin,sys.stdout,sys.stderr=stdi,stdo,stde
Back to top
Toggle table of contents sidebar
Ошибки при конвертации#
При конвертации между строками и байтами очень важно точно знать, какая
кодировка используется, а также знать о возможностях разных кодировок.
Например, кодировка ASCII не может преобразовать в байты кириллицу:
In [32]: hi_unicode = 'привет' In [33]: hi_unicode.encode('ascii') --------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) <ipython-input-33-ec69c9fd2dae> in <module>() ----> 1 hi_unicode.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
Аналогично, если строка «привет» преобразована в байты, и попробовать
преобразовать ее в строку с помощью ascii, тоже получим ошибку:
In [34]: hi_unicode = 'привет' In [35]: hi_bytes = hi_unicode.encode('utf-8') In [36]: hi_bytes.decode('ascii') --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-36-aa0ada5e44e9> in <module>() ----> 1 hi_bytes.decode('ascii') UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
Еще один вариант ошибки, когда используются разные кодировки для
преобразований:
In [37]: de_hi_unicode = 'grüezi' In [38]: utf_16 = de_hi_unicode.encode('utf-16') In [39]: utf_16.decode('utf-8') --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-39-4b4c731e69e4> in <module>() ----> 1 utf_16.decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Наличие ошибок — это хорошо. Они явно говорят, в чем проблема.
Хуже, когда получается так:
In [40]: hi_unicode = 'привет' In [41]: hi_bytes = hi_unicode.encode('utf-8') In [42]: hi_bytes Out[42]: b'xd0xbfxd1x80xd0xb8xd0xb2xd0xb5xd1x82' In [43]: hi_bytes.decode('utf-16') Out[43]: '뿐胑룐닐뗐苑'
Обработка ошибок#
У методов encode и decode есть режимы обработки ошибок, которые
указывают, как реагировать на ошибку преобразования.
Параметр errors в encode#
По умолчанию encode использует режим strict
— при возникновении ошибок
кодировки генерируется исключение UnicodeError. Примеры такого поведения
были выше.
Вместо этого режима можно использовать replace, чтобы заменить символ
знаком вопроса:
In [44]: de_hi_unicode = 'grüezi' In [45]: de_hi_unicode.encode('ascii', 'replace') Out[45]: b'gr?ezi'
Или namereplace, чтобы заменить символ именем:
In [46]: de_hi_unicode = 'grüezi' In [47]: de_hi_unicode.encode('ascii', 'namereplace') Out[47]: b'gr\N{LATIN SMALL LETTER U WITH DIAERESIS}ezi'
Кроме того, можно полностью игнорировать символы, которые нельзя
закодировать:
In [48]: de_hi_unicode = 'grüezi' In [49]: de_hi_unicode.encode('ascii', 'ignore') Out[49]: b'grezi'
Параметр errors в decode#
В методе decode по умолчанию тоже используется режим strict и
генерируется исключение UnicodeDecodeError.
Если изменить режим на ignore, как и в encode, символы будут просто
игнорироваться:
In [50]: de_hi_unicode = 'grüezi' In [51]: de_hi_utf8 = de_hi_unicode.encode('utf-8') In [52]: de_hi_utf8 Out[52]: b'grxc3xbcezi' In [53]: de_hi_utf8.decode('ascii', 'ignore') Out[53]: 'grezi'
Режим replace заменит символы:
In [54]: de_hi_unicode = 'grüezi' In [55]: de_hi_utf8 = de_hi_unicode.encode('utf-8') In [56]: de_hi_utf8.decode('ascii', 'replace') Out[56]: 'gr��ezi'
Overview
Example errors:
Traceback (most recent call last):
File "unicode_ex.py", line 3, in
print str(a) # this throws an exception
UnicodeEncodeError: 'ascii' codec can't encode character u'xa1' in position 0: ordinal not in range(128)
This issue happens when Python can’t correctly work with a string variable.
Strings can contain any sequence of bytes, but when Python is asked to work with the string, it may decide that the string contains invalid bytes.
In these situations, an error is often thrown that mentions ordinal not in range
, or codec can't encode character
, or codec can't decode character
.
Here’s a bit of code that may reproduce the error in Python 2:
a='xa1'
print(a + ' <= problem')
unicode(a)
Initial Steps Overview
-
Check Python version
-
Determine codec and character
Detailed Steps
1) Check Python version
The Python version you are using is significant.
You can determine the Python version by running:
python --version
or, if you have access to the running code, by logging it:
print(sys.version)
The major number (2 or 3) is the number you are interested in.
It is expected that you are using Python2.
2) Determine interpreting codec and character
Get this from the error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'xa1' in position 0: ordinal not in range(128)
In this case, the code is ascii
and the character is the hex character A1
.
What is happening here is that Python is trying to interpret a string, and expects that the bytes in that string are legal for the format it’s expecting. In this case, it’s expecting a string composed of ASCII bytes. These bytes are in the range 0-127 (ie 8 bytes). The hex byte A1
is 161 in decimal, and is therefore out of range.
When Python comes to interpret this string in a context that requires a codec (for example, when calling the unicode
function), it tries to ‘encode’ it with the codec, and can hit this problem.
3) Determine desired codec
You need to figure out how the bytes should be interpreted.
Most often in everyday use (eg web scraping or document ingestion), this is utf-8
.
Once you have determined the desired codec, solution A may help you.
Solutions List
A) Decode the string
Solutions Detail
A) Decode the string
If you have a string s
that you want to interpret as utf-8 data, you can try:
s = s.decode('utf-8')
to re-encode the string with the appropriate codec.
Further Information
Owner
Ian Miell
Several errors can arise when an attempt to change from one datatype to another is made. The reason is the inability of some datatype to get casted/converted into others. One of the most common errors during these conversions is Unicode Encode Error which occurs when a text containing a Unicode literal is attempted to be encoded bytes. This article will teach you how to fix UnicodeEncodeError in Python.
Why does the UnicodeEncodeError error arise?
An error occurs when an attempt is made to save characters outside the range (or representable range) of an encoding scheme because code points outside the encoding scheme’s upper bound (for example, ASCII has a 256 range) do not exist. An error would be produced by values greater than +127 or -128. To solve the issue, the string would need to be encoded using an encoding technique that permitted representation of that code point. UTF-8 (Unicode Transformation-8-bit), UTF-16, UTF-32, ASCII, and others are examples of frequently used encodings. UTF-8 would often fix this problem.
For demonstration, the same error would be reproduced and then fixed:
Python3
a
=
'geeksforgeeks1234567xa0'
.encode(
"ASCII"
)
print
(a)
Output:
Traceback (most recent call last):
File “C:/Users/test.py”, line 1, in <module>
b = ‘geeksforgeeks1234567xa0’.encode(“ASCII”)
UnicodeEncodeError: ‘ascii’ codec can’t encode character ‘xa0’ in position 20: ordinal not in range(128)
How to solve this UnicodeEncodeError?
The error is the same as the one in hand. The error arose as an attempt to represent a character was made, which was outside the range of the ASCII encoding system. i.e., ASCII could only represent character values between the range -128 to 127, but xa0 = 128, which is outside the range of ASCII. This led to the error. To rectify this error, we have to encode the text in a scheme that allows more code points (range) than ASCII. UTF-8 would serve this purpose.
Python3
a
=
'geeksforgeeks1234567xa0'
.encode(
"UTF-8"
)
print
(a)
Output:
b'geeksforgeeks1234567xc2xa0'
The program was executed this time because the string was encoded by a standard that allowed encoding code points greater than 128. Due to this, the character xa0 (code point 128) got converted to xc2xa0, a two-byte representation.
Similarly, the error UnicodeEncodeError could be resolved by encoding to a format such as UTF-16/32, etc.
Python3
a
=
'geeksforgeeks1234567xa0'
.encode(
"UTF-16"
)
print
(a, end
=
"nnn"
)
a
=
'geeksforgeeks1234567xa0'
.encode(
"UTF-32"
)
print
(a)
Output:
b’xffxfegx00ex00ex00kx00sx00fx00ox00rx00gx00ex00ex00kx00sx001x002x003x004x005x006x007x00xa0x00′
b’xffxfex00x00gx00x00x00ex00x00x00ex00x00x00kx00x00x00sx00x00x00fx00x00x00ox00x00x00rx00x00x00gx00x00x00ex00x00x00ex00x00x00kx00x00x00sx00x00x001x00x00x002x00x00x003x00x00x004x00x00x005x00x00x006x00x00x007x00x00x00xa0x00x00x00′
Last Updated :
23 Jan, 2023
Like Article
Save Article