Author Topic: PAK binary format  (Read 2354 times)

0 Members and 1 Guest are viewing this topic.

Offline wlindley

PAK binary format
« on: May 26, 2015, 12:41:35 AM »
Let's de-mystify the .pak file format.  Paks are actuallyfairly simple data files, although the details can certainly be complex.

In besch/reader/obj_reader.cc, obj_reader_t::read_file() opens a file and calls read_nodes() in that same .cc to reach each node.  The file begins with the version information, terminated with a Ctrl+Z (0x1A) byte:

Code: [Select]
53 69 6d 75 74 72 61 6e  73 20 6f 62 6a 65 63 74  |Simutrans object|
20 66 69 6c 65 0a 43 6f  6d 70 69 6c 65 64 20 77  | file.Compiled w|
69 74 68 20 53 69 6d 4f  62 6a 65 63 74 73 20 30  |ith SimObjects 0|
2e 31 2e 33 65 78 70 0a  1a eb 03 00 00 52 4f 4f  |.1.3exp......ROO|
54 01 00 00 00 42 55 49  4c 26 00 25 00 08 80 03  |T....BUIL&.%....|

Following that are four bytes of Pak-File version (eb 03 00 00 above), and then a series of nodes until the end of file.  Each node is processed by its appropriate reader found in the besch/reader/ subdirectory. In the file, each node begins with four characters describing the node type, as defined in besch/objversion.h:

Code: [Select]
enum obj_type
{
        obj_bridge      = C4ID('B','R','D','G'),
        obj_building    = C4ID('B','U','I','L'),

and then a 2-byte (16-bit) child count and 2-byte (16-bit) data block size.  If the data block is more than 64k bytes, 0xFFFF is used for the data size, followed by a four-byte (32-bit) data block size.  Then the actual data block bytes, followed by any additional nodes in this same format. 

The child count indicates how many of the following nodes are considered to belong to (be "inside") the current node.  The BUIL node in the example has 0x0026 child nodes.  This is how, for example, a single pak file can contain multiple objects, with each object containing several child nodes.

Note that read_nodes() chooses the internal class type from the four-character name, using the following line of code:

Code: [Select]
        obj_reader_t *reader = obj_reader->get(static_cast<obj_type>(node.type));

How exactly that works, in converting a text representation to a somewhat conceptual C++ class type, is left to the student as an exercise.

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 4900
  • Total likes: 217
  • Helpful: 108
  • Languages: EN, NO
Re: PAK binary format
« Reply #1 on: May 26, 2015, 05:14:58 AM »
The PAK file format is simple. Interpreting the data inside it is difficult, especially with all the different versions.

Offline DrSuperGood

Re: PAK binary format
« Reply #2 on: May 26, 2015, 12:29:25 PM »
There is any mystery with the pak format?! The game is open source so surely its exact mechanics are exposed to everyone?

I thought the compiled pak format was specifically un-documented to act as some form of crude protection. Specifically to restrict people from easily extracting art work from a pak file and using it/calling it their own in another pak object. At least that is what was hinted by various documentation.

Quote
How exactly that works, in converting a text representation to a somewhat conceptual C++ class type, is left to the student as an exercise.
Sounds kind of like a service provider that was not properly implemented. Java uses such classes heavily for I/O (FileSystems, image framework etc) to solve the same problem present here (resolving which loader to use from a collection of loaders) All "services" (in this case the readers) register with the "service provider" by providing the required data (in this case it would be the object type in the form of a 4 character code). The service provider when used will use the required data to resolve and return the appropriate service (reader).

Offline Ters

  • Coder/patcher
  • Devotee
  • *
  • Posts: 4900
  • Total likes: 217
  • Helpful: 108
  • Languages: EN, NO
Re: PAK binary format
« Reply #3 on: May 26, 2015, 05:16:20 PM »
There is any mystery with the pak format?! The game is open source so surely its exact mechanics are exposed to everyone?

It's just like the Rosetta stone, that was exhibited in a public museum in 1802, but where the Egyptian text wasn't understood until 20 years later.

I thought the compiled pak format was specifically un-documented to act as some form of crude protection. Specifically to restrict people from easily extracting art work from a pak file and using it/calling it their own in another pak object. At least that is what was hinted by various documentation.

Nobody said that to me when I made Simupak Explorer three years ago (almost to the day). Besides, the artwork is the easiest thing to extract from the pak set files, once you know about the run-length encoding. The format for this has never changed, just the number of images in each image list.

Offline prissi

  • Developer
  • Administrator
  • *
  • Posts: 8830
  • Total likes: 324
  • Helpful: 229
  • Languages: De,EN,JP
Re: PAK binary format
« Reply #4 on: May 26, 2015, 06:30:01 PM »
Well, the image format exist in three versions, as the first allowed only for 65k images size (or less than pak256 only).

The thing with not stealing is obsolete, as there are only two closed source pak sets around, both not updated anymore (pak64.german and pak96.comic).

Also if you want to extract certain data, looking at the log files from simutrans start up gives you many parameters. Additionally makeobj has a dump mode, where you get the hex representation of the nodes; there is even a patch for extracting images.

Tools like resizeobj assume also no big changes, or they would break too.

Finally, the node structure allows for many paks in one pak file, and for subnodes of nodes (like fields etc for factories).