python反编译工具一抓一大把
为什么还要自己搞?
python混肴代码可以让部分工具反编译失败,这还不是最难受的,有的人直接修改了python字节码,自己编译了python,会有人这么无聊吗?没错我碰上了
碰上这种情况怎么办?搞一份python代码,在修改过的python里跑一遍,在原版的python里跑一遍,对比字节码在修改回来就可以反编译了
python编译后的字节码存储在pyc文件中,这个pyc文件实际上就是PyCodeObject对象的序列化文本,也就是说我们搞懂这个PyCodeObject结构就行了
这个结构体的定义如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| typedef struct { PyObject_HEAD int co_argcount; int co_nlocals; int co_stacksize; int co_flags; PyObject *co_code; PyObject *co_consts; PyObject *co_names; PyObject *co_varnames; PyObject *co_freevars; PyObject *co_cellvars; PyObject *co_filename; PyObject *co_name; int co_firstlineno; PyObject *co_lnotab; void *co_zombieframe; } PyCodeObject;
|
每个PyCodeObject代表一个Code Block,也可以称之为一个作用域
一个pyc文件中不止一个Code Block,一个文件,函数,类,都会对应一个Code Block
对应文件的PyCodeObject的子作用域存储在co_consts中
口嗨多无聊,来份代码玩一玩吧
1 2 3 4 5 6 7 8 9
| s = 'string' i = 10
def func(): print 'pyc file format' ss = 'new string' return ss s2 = func() print s2
|
编译成pyc文件:
1
| python2 -m compileall main.py
|
hexdump先来看一眼16进制
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| 00000000 03 f3 0d 0a 6b af be 5d 63 00 00 00 00 00 00 00 |....k..]c.......| 00000010 00 01 00 00 00 40 00 00 00 73 27 00 00 00 64 00 |.....@...s'...d.| 00000020 00 5a 00 00 64 01 00 5a 01 00 64 02 00 84 00 00 |.Z..d..Z..d.....| 00000030 5a 02 00 65 02 00 83 00 00 5a 03 00 65 03 00 47 |Z..e.....Z..e..G| 00000040 48 64 03 00 53 28 04 00 00 00 74 06 00 00 00 73 |Hd..S(....t....s| 00000050 74 72 69 6e 67 69 0a 00 00 00 63 00 00 00 00 01 |tringi....c.....| 00000060 00 00 00 01 00 00 00 43 00 00 00 73 0f 00 00 00 |.......C...s....| 00000070 64 01 00 47 48 64 02 00 7d 00 00 7c 00 00 53 28 |d..GHd..}..|..S(| 00000080 03 00 00 00 4e 73 0f 00 00 00 70 79 63 20 66 69 |....Ns....pyc fi| 00000090 6c 65 20 66 6f 72 6d 61 74 73 0a 00 00 00 6e 65 |le formats....ne| 000000a0 77 20 73 74 72 69 6e 67 28 00 00 00 00 28 01 00 |w string(....(..| 000000b0 00 00 74 02 00 00 00 73 73 28 00 00 00 00 28 00 |..t....ss(....(.| 000000c0 00 00 00 73 07 00 00 00 6d 61 69 6e 2e 70 79 74 |...s....main.pyt| 000000d0 04 00 00 00 66 75 6e 63 05 00 00 00 73 06 00 00 |....func....s...| 000000e0 00 00 01 05 01 06 01 4e 28 04 00 00 00 74 01 00 |.......N(....t..| 000000f0 00 00 73 74 01 00 00 00 69 52 02 00 00 00 74 02 |..st....iR....t.| 00000100 00 00 00 73 32 28 00 00 00 00 28 00 00 00 00 28 |...s2(....(....(| 00000110 00 00 00 00 73 07 00 00 00 6d 61 69 6e 2e 70 79 |....s....main.py| 00000120 74 08 00 00 00 3c 6d 6f 64 75 6c 65 3e 02 00 00 |t....<module>...| 00000130 00 73 08 00 00 00 06 01 06 02 09 04 09 01 |.s............|
|
前4个字节magic number对应不同的python版本,低字节的0d0a就是\r\n
紧接着的4个字节 6b af be 5d 是时间戳,代表着修改的时间
一段一段来看吧
1 2 3 4 5
| 00000000 .. .. .. .. .. .. .. .. 63 00 00 00 00 00 00 00 |....k..]c.......| 00000010 00 01 00 00 00 40 00 00 00 73 27 00 00 00 64 00 |.....@...s'...d.| 00000020 00 5a 00 00 64 01 00 5a 01 00 64 02 00 84 00 00 |.Z..d..Z..d.....| 00000030 5a 02 00 65 02 00 83 00 00 5a 03 00 65 03 00 47 |Z..e.....Z..e..G| 00000040 48 64 03 00 53
|
紧跟着的是0x63,字符‘c’,这是一个标识(TYPE_CODE)
跟着这个标识的4个字节是全局 code block的位置的参数数量(co_argument),上述代码为0
在后面的4个字节是code block的局部变量参数个数(co_nlocals),上述代码同样为0
在后面的4个字节就是栈空间了,针对当前的code block,上述代码栈值为1
在后面的4个字节为co_flags,上述代码为0x40
到了重要的环节了,看到紧跟着的0x73了吗,在这之后就是字节码了,0x73代表的是TYPE_STRING,也就是PyStringObject的标识,PyCodeObject的字节码序列是用PyStringObject对象来保存的
0x73后4个字节是字节码的大小 ,上述代码为0x27,也就是说在0x64(包括)后的0x27个字节都是python的字节码
用python的dis模块来验证下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| >>> f = open('main.pyc') >>> f.read(8) '\x03\xf3\r\nk\xaf\xbe]' >>> c = marshal.load(f) >>> c.co_consts ('string', 10, <code object func at 0x7f392fbbbc30, file "main.py", line 5>, None) >>> c.co_names ('s', 'i', 'func', 's2') >>> dis.dis(c) 2 0 LOAD_CONST 0 ('string') 3 STORE_NAME 0 (s)
3 6 LOAD_CONST 1 (10) 9 STORE_NAME 1 (i)
5 12 LOAD_CONST 2 (<code object func at 0x7f392fbbbc30, file "main.py", line 5>) 15 MAKE_FUNCTION 0 18 STORE_NAME 2 (func)
9 21 LOAD_NAME 2 (func) 24 CALL_FUNCTION 0 27 STORE_NAME 3 (s2)
10 30 LOAD_NAME 3 (s2) 33 PRINT_ITEM 34 PRINT_NEWLINE 35 LOAD_CONST 3 (None) 38 RETURN_VALUE >>>
|
刚好39个字节(0x27),dis输出代表值:
所在列 |
说明 |
第 1 列 |
在源代码中的行数 |
第 2 列 |
该指令在co_code中的偏移 |
第 3 列 |
opcode,分为有操作数和无操作数两种,是一个字节的整数 |
第 4 列 |
操作数,占两个字节 |
python opcode对应字节码就不说了,自行查看吧
1 2 3 4
| 00000040 .. .. .. .. .. 28 04 00 00 00 74 06 00 00 00 73 |Hd..S(....t....s| 00000050 74 72 69 6e 67 69 0a 00 00 00 63 00 00 00 00 01 |tringi....c.....| 00000060 00 00 00 01 00 00 00 43 00 00 00 73 0f 00 00 00 |.......C...s....| 00000070 64 01 00 47 48 64 02 00 7d 00 00 7c 00 00 53 00
|
opcode结束了,在0x28开始就是co_consts的内容了,这里保存了code block的常量
紧跟着的4个字节是元素数量,本例中为0x4,有4个元素
第一个数据类型是PyStringObject,TYPE_CODE为0x74,0x74后面的4个字节为字符串长度,后面为字符串内容
第二个数据类型为int,对应TYPE_CODE为0x69,后面的4个字节为内容,0xA
第三个数据类型为PyCodeObject,TYPE_CODE为0x63,和上面一样重新分析,在这不赘述了
跳过上段的code block之后,就是文件信息了
1 2 3
| 000000c0 .. .. .. 73 07 00 00 00 6d 61 69 6e 2e 70 79 74 |...s....main.pyt| 000000d0 04 00 00 00 66 75 6e 63 05 00 00 00 73 06 00 00 |....func....s...| 000000e0 00 00 01 05 01 06 01 4e 28 04 00 00 00 74 01 00 |.......N(....t..|
|
0x73,字符类型,0x07,字符长度,后面是字符串
紧跟着的是co_name,标识为0x74,然后是长度0x4,跟着就是4个字节的函数名,func,后面还有4个字节,代表的是在文件中的行数,上例中为5
然后是字节码指令与源文件行号对应的co_lnotab,以PyStringObject对象存储,先是标识0x73(‘s’),然后是4字节的长度0x00000006,然后是内容0x010601050100
剩下的内容:
1 2 3 4 5
| 000000f0 00 00 73 74 01 00 00 00 69 52 02 00 00 00 74 02 |..st....iR....t.| 00000100 00 00 00 73 32 28 00 00 00 00 28 00 00 00 00 28 |...s2(....(....(| 00000110 00 00 00 00 73 07 00 00 00 6d 61 69 6e 2e 70 79 |....s....main.py| 00000120 74 08 00 00 00 3c 6d 6f 64 75 6c 65 3e 02 00 00 |t....<module>...| 00000130 00 73 08 00 00 00 06 01 06 02 09 04 09 01 |.s............|
|