쿠...sal: [컴][파이썬] python 3 의 string 정리

python 3 의 string

python 3 의 string 을 정리 해 보자. python 2 에 대한 정리는 복잡하고, 요즘은 딱히 할 필요가 없다고 생각하니 ref. 1 을 보고 알아서 판단하자.

python2 에서 python3 string 사용하기

여기서 python 3 의 string 만을 이야기하려는 것은 python 2 에서도 python 3 처럼 string 을 사용할 수 있기 때문이다. module 의 가장 상단에 아래처럼 적어주면 된다. 이것을 사용하는 것이 나중에 python 3 로 convert 하기에도 편하다.

from __future__ import unicode_literals    # at top of module

기본 string 의 표현, unicode

python 3 에서 기본적으로 모든 string 은 unicode 이다. 그리고 이 unicode 의 default encoding 은 utf-8 이다. 하지만 그렇다고 해서 python 3 의 string 은 utf-8 이다라고 생각하면 안된다. 그냥 unicode 이다 라고 인식해야 한다. 그래야 나중에 string.encode 등을 할 때 혼란이 없다.

byte object, b'가'

python 3에서 제공하는 다른 형태의 string 은 b'가' 형태의 string 이 있다. 앞에 b 를 붙여서 byte object 인 것을 보여준다. byte object 는 그냥 c/c++ 에서 char 의 array 이다.

>>>'가'.encode()
b'\xea\xb0\x80'
>>>inint = struct.unpack('<bbb', '가'.encode())
None

그래서 아래 같은 표현은 성립이 안된다.

>>> b'가'
SyntaxError: bytes can only contain ASCII literal characters.

b'가' 는 '가' 를 char 변수에 넣으려는 시도인데, utf-8 에서 '가' 는 3byte 를 사용하기 때문이다.
참고로 실제로 '가' 를 byte object 에 넣으려면 아래처럼 해야 한다.

>>> b'\xea\xb0\x80'

string.encode --> byte object

python 3 의 string 에서 encode 를 호출하면, 그 string 에 대한 byte object 를 return 해준다. 이 때 parameter 로 encoding 을 넘겨주는데, 그 encoding 에 따른 값을 전달해 준다.(이래서 string 은 unicode 라고 기억해야 한다.)

>>> '가'.encode() # default 값이 utf8 이다.
b'\xea\xb0\x80'
>>> '가'.encode('utf-16')


b'\xff\xfe\x00\xac'
>>> '\x80abc'.encode("mbcs", "strict")  
Traceback (most recent call last):
    ...
UnicodeDecodeError: 'cp949' codec can't decode byte 0x80 in position 0:
  invalid start byte
>>> '\x80abc'.encode("mbcs", "replace")
'\ufffdabc'
>>> '\x80abc'.encode("mbcs", "backslashreplace")
'\\x80abc'
>>> '\x80abc'.encode("mbcs", "ignore")
'abc'

replace() / backslashreplace() / ignore() 는 ref.3을 참고하자.

byte object.decode --> string

반대로 byte object 에서 string 으로 변경할 때는 decode 를 사용하면 된다.(참고로 string 은 decode 라는 함수가 없다.)

>>> b'\xea\xb0\x80'.decode()
'가'
>>> b'\xff\xfe\x00\xac'.decode('utf-16')
'가'

>>> b'\x80abc'.decode("utf-8", "strict")  
Traceback (most recent call last):
    ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
  invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'

file write

file 에 write 할 때도 python 3 에서는 기본적으로 unicode 이기 때문에 byte array 로 바꿔줘야 한다.
>> with open('nonlat.txt', 'wb') as f:
f.write(nonlat.encode())

json.dump

ensure_ascii 를 false 로 줘야 file 에서 제대로 된 '한글'을 확인할 수 있다.

with codecs.open(configPath, 'w', encoding='utf-8') as fp:
    json.dump(lastConfig, fp, ensure_ascii=False)

datetime

On Windows + Py3, time.strftime() and Unicode characters will raise UnicodeEncodeError · Issue #2102 · sphinx-doc/sphinx · GitHub

(sdate.strftime('%m월 %d일(%a) %I:%M %p'.encode('unicode-escape').decode())).encode().decode('unicode-escape')  # 11월 1일(수) 오후 3시

쿠...sal

[컴][파이썬] python 3 의 string 정리