赞
踩
示例代码如下
其中,a直接回车打印的是16进制编码,print a打印的是汉字,怎么做到的?
首先注意我们是在交互环境,输入的内容会立即解析,其源头就是将标准输入当成了读取文件一样:
- int
- Py_Main(int argc, char **argv)
- {
- ...
- sts = PyRun_AnyFileExFlags(
- fp,
- filename == NULL ? "<stdin>" : filename,
- filename != NULL, &cf) != 0;
- ...
然后就得到了opcode PRINT_EXPR
- PyObject *
- PyEval_EvalFrameEx(PyFrameObject *f, int throwflag)
- {
- ...
- case PRINT_EXPR:
- v = POP();
- w = PySys_GetObject("displayhook");
- if (w == NULL) {
- PyErr_SetString(PyExc_RuntimeError,
- "lost sys.displayhook");
- err = -1;
- x = NULL;
- }
- if (err == 0) {
- x = PyTuple_Pack(1, v);
- if (x == NULL)
- err = -1;
- }
- if (err == 0) {
- w = PyEval_CallObject(w, x);
- Py_XDECREF(w);
- if (w == NULL)
- err = -1;
- }
- Py_DECREF(v);
- Py_XDECREF(x);
- break;
- ...
可以看到它在sys模块中找到一个displayhook函数进行输出。
- static PyMethodDef sys_methods[] = {
- ...
- {"displayhook", sys_displayhook, METH_O, displayhook_doc},
- ...
- }
-
- static PyObject *
- sys_displayhook(PyObject *self, PyObject *o)
- {
- ...
- outf = PySys_GetObject("stdout");
- if (outf == NULL) {
- PyErr_SetString(PyExc_RuntimeError, "lost sys.stdout");
- return NULL;
- }
- if (PyFile_WriteObject(o, outf, 0) != 0)
- return NULL;
- ...
就像写文件一样,将变量o写入<stdout>(标准输出)文件中,注意这里的flags=0,而且会一直往下传递出去。
flags的意义就是,最终生成的字符串s,要不要去掉引号。即原生的字符串内容,不添加任何额外骚操作。
- /* Flag bits for printing: */
- #define Py_PRINT_RAW 1 /* No string quotes etc. */
对于str和unicode,前面的流程都是一样的,在 internal_print 开始走各自流程,原因在于 PyString_Type 的 tp_print 不为空。
最终走到 string_print,因为flags=0,意思就是“我不要原生字符串,你给我加工一下!”,于是一堆转义。
- static int
- string_print(PyStringObject *op, FILE *fp, int flags)
- {
- ...
- /* figure out which quote to use; single is preferred */
- quote = '\'';
- if (memchr(op->ob_sval, '\'', Py_SIZE(op)) &&
- !memchr(op->ob_sval, '"', Py_SIZE(op)))
- quote = '"';
-
- str_len = Py_SIZE(op);
- Py_BEGIN_ALLOW_THREADS
- fputc(quote, fp);
- for (i = 0; i < str_len; i++) {
- /* Since strings are immutable and the caller should have a
- reference, accessing the interal buffer should not be an issue
- with the GIL released. */
- c = op->ob_sval[i];
- if (c == quote || c == '\\')
- fprintf(fp, "\\%c", c);
- else if (c == '\t')
- fprintf(fp, "\\t");
- else if (c == '\n')
- fprintf(fp, "\\n");
- else if (c == '\r')
- fprintf(fp, "\\r");
- else if (c < ' ' || c >= 0x7f)
- fprintf(fp, "\\x%02x", c & 0xff);
- else
- fputc(c, fp);
- }
- fputc(quote, fp);
- Py_END_ALLOW_THREADS
- return 0;
- }
a的打印为什么能看到'\xba\xba'?核心代码就是 fprintf(fp, "\\x%02x", c & 0xff);
a因为是str的汉字,1个汉字占用2个字节,每个字节按两位的16进制输出%02x,再加个前缀\\x表示是16进制。
flags最终在这里起了作用,走了PyObject_Repr,将变量op变成字符串s,再通过 internal_print 打印到fp(也就是<stdout>)上。
- static int
- internal_print(PyObject *op, FILE *fp, int flags, int nesting)
- {
- ...
- else if (Py_TYPE(op)->tp_print == NULL) {
- PyObject *s;
- if (flags & Py_PRINT_RAW)
- s = PyObject_Str(op);
- else
- s = PyObject_Repr(op);
- if (s == NULL)
- ret = -1;
- else {
- ret = internal_print(s, fp, Py_PRINT_RAW,
- nesting+1);
- }
- Py_XDECREF(s);
- ...
-
- PyObject_Repr(PyObject *v)
- {
- ...
- if (v == NULL)
- return PyString_FromString("<NULL>");
- else if (Py_TYPE(v)->tp_repr == NULL)
- return PyString_FromFormat("<%s object at %p>",
- Py_TYPE(v)->tp_name, v);
- else {
- PyObject *res;
- res = (*Py_TYPE(v)->tp_repr)(v);
- ...
关于unicode的骚操作都在这里了
- const Py_ssize_t expandsize = 6;
- ...
- repr = PyString_FromStringAndSize(NULL,
- 2
- + expandsize*size
- + 1);
我们看到返回值repr的空间是这么开辟的
前面的2:开头的 u'
后面的1:结尾的 '
中间:每6个字节(expandsize)显示1个编码,格式为 \uXXXX,X也就是4个比特位填充
- if (ch >= 256) {
- *p++ = '\\';
- *p++ = 'u';
- *p++ = hexdigit[(ch >> 12) & 0x000F];
- *p++ = hexdigit[(ch >> 8) & 0x000F];
- *p++ = hexdigit[(ch >> 4) & 0x000F];
- *p++ = hexdigit[ch & 0x000F];
- } // 可以发现,ch编码是大端存储的
最终我们有了 u'\u6c49'
首先要确定的是print对应的opcode是PRINT_ITEM
- >>> def foo():
- ... print a, b
- ...
- >>> import dis
- >>> dis.dis(foo)
- 2 0 LOAD_GLOBAL 0 (a)
- 3 PRINT_ITEM
- 4 LOAD_GLOBAL 1 (b)
- 7 PRINT_ITEM
- 8 PRINT_NEWLINE
- 9 LOAD_CONST 0 (None)
- 12 RETURN_VALUE
千万不要误解成从LOAD_GLOBAL找到print函数,然后CALL_FUNCTION之类的,除非你这么写
- >>> def foo():
- ... getattr(__builtins__,'print')('a')
- ...
- >>> dis.dis(foo)
- 2 0 LOAD_GLOBAL 0 (getattr)
- 3 LOAD_GLOBAL 1 (__builtins__)
- 6 LOAD_CONST 1 ('print')
- 9 CALL_FUNCTION 2
- 12 LOAD_CONST 2 ('a')
- 15 CALL_FUNCTION 1
- 18 POP_TOP
- 19 LOAD_CONST 0 (None)
- 22 RETURN_VALUE
言归正传,那我们看看PRINT_ITEM是怎么打印成汉字的?
- TARGET_NOARG(PRINT_ITEM)
- {
- v = POP();
- if (stream == NULL || stream == Py_None) {
- w = PySys_GetObject("stdout");
- if (w == NULL) {
- PyErr_SetString(PyExc_RuntimeError,
- "lost sys.stdout");
- err = -1;
- }
- }
- /* PyFile_SoftSpace() can exececute arbitrary code
- if sys.stdout is an instance with a __getattr__.
- If __getattr__ raises an exception, w will
- be freed, so we need to prevent that temporarily. */
- Py_XINCREF(w);
-
- // 如果前面都没问题的话,看看有没有必要插入一个空格
- if (w != NULL && PyFile_SoftSpace(w, 0))
- err = PyFile_WriteString(" ", w);
-
- if (err == 0)
- err = PyFile_WriteObject(v, w, Py_PRINT_RAW);
-
- if (err == 0) {
- /* XXX move into writeobject() ? */
- if (PyString_Check(v)) {
- // 如果打印的对象是str的空字符串
- // 调用一下 PyFile_SoftSpace(w, 1);
- ...
- }
- #ifdef Py_USING_UNICODE
- else if (PyUnicode_Check(v)) {
- // 如果打印的对象是unicode的空字符串
- // 调用一下 PyFile_SoftSpace(w, 1);
- ...
- }
- #endif
- else
- // 总之就是要调用一下啦
- PyFile_SoftSpace(w, 1);
- }
- Py_XDECREF(w);
- Py_DECREF(v);
- Py_XDECREF(stream);
- stream = NULL;
- if (err == 0) DISPATCH();
- break;
- }
后面的 PyFile_SoftSpace 是为了给print元素之间加空格,加空格的逻辑就是上面的 PyFile_WriteString(" ", w);
- /* Interface for the 'soft space' between print items. */
-
- int
- PyFile_SoftSpace(PyObject *f, int newflag)
- {
- ...
- else if (PyFile_Check(f)) {
- oldflag = ((PyFileObject *)f)->f_softspace;
- ((PyFileObject *)f)->f_softspace = newflag;
- }
- ...
- }
核心代码在 PyFile_WriteObject(v, w, Py_PRINT_RAW); 这里总算要求,打印不加工的原生字符串了,是不是这个原因导致打出汉字的呢?
调用流程如下,我们又回到了string_print
这次是这段代码起了作用
- if (flags & Py_PRINT_RAW) {
- char *data = op->ob_sval;
- Py_ssize_t size = Py_SIZE(op);
- Py_BEGIN_ALLOW_THREADS
- while (size > INT_MAX) {
- // 对于很长的字符串,按14比特位为1个单位分批输出
- // 显然这也是有问题的:
- // 1. 可能会有内存对齐问题
- // 2. 字节都拆开了,输出会有误吧!(需验证)
- const int chunk_size = INT_MAX & ~0x3FFF;
- fwrite(data, 1, chunk_size, fp);
- data += chunk_size;
- size -= chunk_size;
- }
- fwrite(data, 1, (size_t)size, fp);
- Py_END_ALLOW_THREADS
- return 0;
- }
在这里fwrite代替了,非原生字符串输出的fputc和fprintf
print unicode走了完全不一样的路子,在PyFile_WriteObject就将unicode对象value转换成str了
- #ifdef Py_USING_UNICODE
- if ((flags & Py_PRINT_RAW) &&
- PyUnicode_Check(v) && enc != Py_None) {
- char *cenc = PyString_AS_STRING(enc);
- char *errors = fobj->f_errors == Py_None ?
- "strict" : PyString_AS_STRING(fobj->f_errors);
- value = PyUnicode_AsEncodedString(v, cenc, errors);
- if (value == NULL)
- return -1;
- } else {
- ...
- }
- result = file_PyObject_Print(value, fobj, flags);
- Py_DECREF(value);
- return result;
cenc取的终端字符编码'cp936',errors=’strict',核心代码就在 PyUnicode_AsEncodedString(v, cenc, errors); 这句了
_PyCodec_EncodeInternal的处理很简单,就是调用encoder函数,所以我们得聚焦到 _PyCodec_TextEncoder 怎么找到这个encoder的
其实核心代码就是调用了_PyCodec_Lookup(encoding),初始化了一个叫encodings的模块,其目录就在
{PythonDir}\Lib\encodings\
__init__.py初始化流程中,定义了search_function搜索函数赋值给codecs,又回到了C模块中
- [__init__.py]
- codecs.register(search_function)
-
- [_codecsmodule.c]
- static PyMethodDef _codecs_functions[] = {
- {"register", codec_register, METH_O,
- register__doc__},
- ...
- }
-
- static
- PyObject *codec_register(PyObject *self, PyObject *search_function)
- {
- if (PyCodec_Register(search_function))
- return NULL;
- Py_RETURN_NONE;
- }
-
- int PyCodec_Register(PyObject *search_function)
- {
- PyInterpreterState *interp = PyThreadState_GET()->interp;
- if (interp->codec_search_path == NULL && _PyCodecRegistry_Init())
- goto onError;
- if (search_function == NULL) {
- PyErr_BadArgument();
- goto onError;
- }
- if (!PyCallable_Check(search_function)) {
- PyErr_SetString(PyExc_TypeError, "argument must be callable");
- goto onError;
- }
- return PyList_Append(interp->codec_search_path, search_function);
-
- onError:
- return -1;
- }
search_function的逻辑,就是用了_aliases,找到encoding对应的文件名aliased_encoding,优先用文件名加载模块,否则就用encoding作为文件名。
我们在aliases.py中,可以确定的是cp936对应gbk.py文件。
search_function返回类型是codecs.CodecInfo,它是tuple的子类,详见{PythonDir}\Lib\codecs.py
要最终找到encoder函数也不容易,它其实是CPyFunction,而_codecs_cn和gbk函数都是宏定义的
- [gbk.py]
- codec = _codecs_cn.getcodec('gbk')
- class Codec(codecs.Codec):
- encode = codec.encode
- decode = codec.decode
-
- [_codecs_cn.c]
- BEGIN_CODECS_LIST
- CODEC_STATELESS(gb2312)
- CODEC_STATELESS(gbk)
- CODEC_STATELESS(gb18030)
- CODEC_STATEFUL(hz)
- END_CODECS_LIST
-
- I_AM_A_MODULE_FOR(cn)
-
-
- // 翻译过来就是这样,定义了几个map,还有encode/decode函数
- static const struct dbcs_map _mapping_list[] = {
- { "gb2312", NULL, (void*)gb2312_decmap },
- { "gbkext", NULL, (void*)gbkext_decmap },
- { "gbcommon", (void*)gbcommon_encmap, NULL },
- { "gb18030ext", (void*)gb18030ext_encmap, NULL },
- { "", NULL, NULL } };
- static const struct dbcs_map *mapping_list = (const struct dbcs_map *)_mapping_list;
-
- static const MultibyteCodec _codec_list[] = {
- { "gb2312", NULL, NULL, gb2312_encode, NULL, NULL, gb2312_decode, NULL, NULL },
- { "gbk", NULL, NULL, gbk_encode, NULL, NULL, gbk_decode, NULL, NULL },
- { "gb18030", NULL, NULL, gb18030_encode, NULL, NULL, gb18030_decode, NULL, NULL },
- { "hz", NULL, NULL, hz_encode, NULL, NULL, hz_decode, NULL, NULL },
- { "", NULL, } };
-
- static const MultibyteCodec *codec_list = (const MultibyteCodec *)_codec_list;
-
- void
- init_codecs_cn(void)
- {
- PyObject *m = Py_InitModule("_codecs_cn", __methods);
- if (m != NULL)
- (void)register_maps(m);
- }
[gbk.py]第一行代码基本上可以翻译成:
1. 在_codecs_cn模块的codec_list里,找到encoding=‘gbk’的MultibyteCodec对象codec
2. 创建PyCapsule对象codecobj,将这个codec包住(capsule->pointer = pointer;)
3. 调用_multibytecodec.__create_codec(codecobj),将MultibyteCodec对象codec取出来,用MultibyteCodecObject对象包住(self->codec = codec;)
4. 返回这个MultibyteCodecObject对象,就是gbk.py中的codec
5. 以encode = codec.encode为例,MultibyteCodecObject对象的函数就看MultibyteCodec_Type的定义了:
- static struct PyMethodDef multibytecodec_methods[] = {
- {"encode", (PyCFunction)MultibyteCodec_Encode,
- METH_VARARGS | METH_KEYWORDS,
- MultibyteCodec_Encode__doc__},
- {"decode", (PyCFunction)MultibyteCodec_Decode,
- METH_VARARGS | METH_KEYWORDS,
- MultibyteCodec_Decode__doc__},
- {NULL, NULL},
- };
为了确定encode逻辑是怎么把unicode变成str的,我们得关心这个gbk是怎么实现的了(把宏定义弄懂)。
看v2.7.15的代码还是有挑战性,v2.7.9的逻辑就很清晰,这样的改动,纯粹只是为了给python3的加个检查
- /* Text encoding/decoding API */
- PyObject * _PyCodec_LookupTextEncoding(const char *encoding,
- const char *alternate_command)
- {
- ...
- if (Py_Py3kWarningFlag && !PyTuple_CheckExact(codec)) {
- attr = PyObject_GetAttrString(codec, "_is_text_encoding");
- if (attr == NULL) {
- if (!PyErr_ExceptionMatches(PyExc_AttributeError))
- goto onError;
- PyErr_Clear();
- } else {
- is_text_codec = PyObject_IsTrue(attr);
- Py_DECREF(attr);
- if (is_text_codec < 0)
- goto onError;
- if (!is_text_codec) {
- PyObject *msg = PyString_FromFormat(
- "'%.400s' is not a text encoding; "
- "use %s to handle arbitrary codecs",
- encoding, alternate_command);
- if (msg == NULL)
- goto onError;
- if (PyErr_WarnPy3k(PyString_AS_STRING(msg), 1) < 0) {
- Py_DECREF(msg);
- goto onError;
- }
- Py_DECREF(msg);
- }
- }
- }
- ...
方便起见,你可以先了解这些函数:
[_codecsmodule.c]
ascii_encode/ascii_decode
ascii.py中的逻辑就写的很清晰,函数引用直指这两个!
[_codecs_iso2022.c]
所以我们有理由相信,encoder就是这个MultibyteCodecObject对象的成员函数,调用结果最终会进入
- static PyObject *
- multibytecodec_encode(MultibyteCodec *codec,
- MultibyteCodec_State *state,
- const Py_UNICODE **data, Py_ssize_t datalen,
- PyObject *errors, int flags)
它一手拿着和编译码协议相关的codec结构体({ "gbk", NULL, NULL, gbk_encode, NULL, NULL, gbk_decode, NULL, NULL }),一手拿着我们传入的unicode对象data
通过中间对象MultibyteEncodeBuffer buf;将输入信息buf.inbuf捣腾成buf.outobj。
说白了,就是把unicode的0x6c49变成gbk的0xbaba,然后fwrite到<stdout>。
- [multibytecodec.c]
- static PyObject *
- multibytecodec_encode(MultibyteCodec *codec,
- MultibyteCodec_State *state,
- const Py_UNICODE **data, Py_ssize_t datalen,
- PyObject *errors, int flags)
- {
- ...
- while (buf.inbuf < buf.inbuf_end) {
- Py_ssize_t inleft, outleft;
-
- /* we don't reuse inleft and outleft here.
- * error callbacks can relocate the cursor anywhere on buffer*/
- inleft = (Py_ssize_t)(buf.inbuf_end - buf.inbuf);
- outleft = (Py_ssize_t)(buf.outbuf_end - buf.outbuf);
- r = codec->encode(state, codec->config, &buf.inbuf, inleft,
- &buf.outbuf, outleft, flags);
- if ((r == 0) || (r == MBERR_TOOFEW && !(flags & MBENC_FLUSH)))
- break;
- else if (multibytecodec_encerror(codec, state, &buf, errors,r))
- goto errorexit;
- else if (r == MBERR_TOOFEW)
- break;
- }
-
- [_codecs_cn.c]
- ENCODER(gbk)
- {
- while (inleft > 0) {
- Py_UNICODE c = IN1;
- DBCHAR code;
- if (c < 0x80) {
- WRITE1((unsigned char)c)
- NEXT(1, 1)
- continue;
- }
- UCS4INVALID(c)
- REQUIRE_OUTBUF(2)
- GBK_ENCODE(c, code)
- else return 1;
- OUT1((code >> 8) | 0x80)
- if (code & 0x8000)
- OUT2((code & 0xFF)) /* MSB set: GBK */
- else
- OUT2((code & 0xFF) | 0x80) /* MSB unset: GB2312 */
- NEXT(1, 2)
- }
- return 0;
- }
-
- /* GBK and GB2312 map differently in few codepoints that are listed below:
- *
- * gb2312 gbk
- * A1A4 U+30FB KATAKANA MIDDLE DOT U+00B7 MIDDLE DOT
- * A1AA U+2015 HORIZONTAL BAR U+2014 EM DASH
- * A844 undefined U+2015 HORIZONTAL BAR
- */
-
- #define GBK_DECODE(dc1, dc2, assi) \
- if ((dc1) == 0xa1 && (dc2) == 0xaa) (assi) = 0x2014; \
- else if ((dc1) == 0xa8 && (dc2) == 0x44) (assi) = 0x2015; \
- else if ((dc1) == 0xa1 && (dc2) == 0xa4) (assi) = 0x00b7; \
- else TRYMAP_DEC(gb2312, assi, dc1 ^ 0x80, dc2 ^ 0x80); \
- else TRYMAP_DEC(gbkext, assi, dc1, dc2);
-
- #define GBK_ENCODE(code, assi) \
- if ((code) == 0x2014) (assi) = 0xa1aa; \
- else if ((code) == 0x2015) (assi) = 0xa844; \
- else if ((code) == 0x00b7) (assi) = 0xa1a4; \
- else if ((code) != 0x30fb && TRYMAP_ENC_COND(gbcommon, assi, code));
-
- #define _TRYMAP_ENC(m, assi, val) \
- ((m)->map != NULL && (val) >= (m)->bottom && \
- (val)<= (m)->top && ((assi) = (m)->map[(val) - \
- (m)->bottom]) != NOCHAR)
- #define TRYMAP_ENC_COND(charset, assi, uni) \
- _TRYMAP_ENC(&charset##_encmap[(uni) >> 8], assi, (uni) & 0xff)
都是一系列的查map、移位、计算,这里就不深入了。
有意思的是,查看控制台属性,能发现cp936和gbk的来源,感兴趣的同学可以试试修改这个值再开客户端,看看python又是怎么处理中文的。
1. fwrite和fputc、fprintf属于两组不同的输出接口
2. 变量名+回车的打印(repr方式)会要求字符串加工,print则打印原生字符串(str方式)
3. print unicode才需要转码(encode),print str则不需要
4. 转码,是通过encoding找到对应的py文件,再找到encoder和decoder的,核心逻辑都是查表和位运算。
1. 字符编码的历史和规则
ascii cp936 gbk gb2312 utf8 utf16 这些都是怎么来的?
2. python的字符编码识别和转码
什么时候decode,什么时候encode?
怎么识别文件的字符编码?
怎么识别字符串的字符编码?
3. 文件读写接口的具体不同
fgetc和fputc
fgets和fputs
fread和fwrite
fscanf和fprinf
4. 文本分段fwrite,会不会乱码?
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。