Subject: Re: Could CDR-coding be on the way back? From: Erik Naggum <erik@naggum.net> Date: 13 Dec 2000 03:31:55 +0000 Newsgroups: comp.lang.lisp,comp.arch Message-ID: <3185667115752812@naggum.net> * Per Bothner <per@bothner.com> | > If your data structures have unpredictable structure, use lists. | | If your data structures have unpredictable structure, you may be | in trouble ... Really? Reading Lisp data from input streams is well-defined today, despite an absence of predictability of the objects you read in. It works for both lists and arrays, but arrays are very inefficient because you don't know their size a priori, and copying elements like crazy as the list grows is much less efficient than grabbing cons cells from a pool of them. The way Lisp readers usually deal with arrays is in fact to build a list from the input and then convert the list to an array. You could process the data twice, once to count elements and once to actually store them in memory. You could rely on someone else not lying about the number of elements in the array, which would make the data format impossible to edit by hand. Etc. I recommend programmers in non-list-supporting languages to implement the lists in the protocol syntaxes I have defined as arrays for one good reason: Such languages seldom have automatic memory management, either, so abandoning a reference causes memory leaks. If they keep track of the array, they usually avoid the issue of keeping track of all the individual elements. I have been kind to them by specifying that no list will be longer than 32 elements, so allocating all arrays that big remains a possibility. Much of the implementor feedback has been the difficulty in reading lists of objects of unpredictable types and my recommendation not to convert to an internal type until you need it, but instead keep the string form with a prefix character (the equivalent of a type tag) have resulted in some pretty weird code. | Erik, you know that I am older than you, and have probably been | programming at least as long as you. Don't you imagine I might have | quite a bit of experience with both arrays *and* lists? (I have less | experience with Common Lisp than you, I'm sure though.) I think my | share of "painfully idiotic programming mistakes" is modest, yet I | have not found that my "implementation really must use two elements | per "array"". You recommend arrays to other people, not just to yourself, unless I'm _gravely_ mistaken about the purpose of your article. Other people who have much less experience than you and rely on your experience in order to form their own, with the least possible pain. My experience with people who try to implement things with arrays is that they end up with very cons-like concepts once they outgrow the C "memory is really an array of bytes" idea. The reason is simple: Arrays with the properties we discuss are very, very hard to implement correctly. If we _are_ talking about what a compiler designer would do for a language that natively supports these things in such a way that programmers would never _see_ the array, the kind of reasoning you have provided is completely misplaced. I have therefore assumed that you do _not_ argue about primarily what a compiler writer would do, although CDR-coding _is_ at that level. | Assume a modest one word of overhead per object. Then an extensible | array with three elements takes 3 words for the array header (header + | current length + pointer to data array) plus 4 words for the data | array (1 word of header plus three words of data) or 7 words total. A | non-extensible array would only need 4 words. If you need a separate | header word for the array allocated length, add one more word yielding | 8 or 5 words, respectively. A list of three cons cells take 9 words. | Already for length three, the array wins. I think you need to explain how you ended up with 9 words for the list. There is of course no header word for cons cells if you are going to implement a very basic idea in a language with this. Type information is stored much more efficiently if you care about these things, or you allocate them in blocks that only hold cons cells. That is, arrays of cons cells, if you want. Likewise, your arrays would have type information in a similarly more efficient way. Here's how I count. First, the data array. It has one word per element, but it is aligned at and always allocate in some useful chunk, like 4 words. This is to make allocation significantly cheaper. Since the data array may need to be moved when grown, there is a meta-object that contains the allocation size, the displacement offset, the active size, and a pointer to the data array, or 4 words, one allocation unit. The pointer to an allocation unit would of course have four spare bits that could be used for type information, just as pointers to cons cells have three spare bits for type information, avoiding the header word overhead completely for the most important builtin types. I need 4 cons cells to match an array of 4 elements. | If you use a memory allocation scheme not requiring any header words, | then the extensible 3-element array takes 6 words, the non-extensible | array takes 4 words, and the list version takes 6 words. The array | is equal to the list in space. Yes, but only at enormous expense in allocation costs with such a fine granularity. Also, you really do not want your array to move around if it grows, or you lose the idea of identity of lists. Holding the size without a pointer to data that can move means you always have to copy the array if you do anything at all to it. This means that you cannot perform any operations on the list implemented this way, only on the elements of the fixed-size list. One of the charming issues about lists is that they can grow so easily. If we are to give them a _more_ efficient implementation, you can't take that away from them. | Unless you use cdr-coding, you will need 2 pointers plus any | per-object overhead (which may be zero) for each list element. This | is twice as much space as using an array. This only wins for short | lists, or for extensible size arrays whose size are just past the | doubling threshhold. The overhead is twice as much _after_ the overhead of arrays have been dealt with, including that copying-when-doubling scheme, which out of necessity must allocate memory for the same element many times over as the list grows. By cleverly allocating allocation units for the data array from a single pool that is not used for anything else, the same way cons cells are allocated, you can grow your list painlessly and linearly without copying until you allocate another array. You would typically inform the allocator that you are allocating for a growing list if you did this, howeer, to arrange for a large section of that space to be temporarily reserved for the growing array. It would revert to normal allocation once you inform the allocator that you hare through growing that particular list. This cuts down on copying, on reallocating, and speeds up just about everything while growing the lists linearly, at the cost of a little bit more explicit code and knowledge sharing between allocator and user. In a language with support for these things you would want that, even if it would mostly be used by system functions that build lists as return values. If you get the impression that I have implemented this scheme for real, that's exactly right. | Because it is convenient, not necessarily because it is efficient. When processing data that produces values of unknown lengths, and I find myself doing that with an alarmingly high frequency compared to values of known lengths, the growing pain of the array approach has costs that do not win in the long run. If you usually process data that produces values of known lengths with a similarly much higher frequency than unknown lengths, you still have much higher overhad than if you avoided the cleverness in the array implementation that would take care of the growing. Hence, if you actually implement this and compare approaches and test efficiency at a product-quality level, lists made up of cons cells actually do win the efficiency contest, especialy with a really fast cons cell allocator, and this not the least because of the internal proximity of the data, which beats that of the array that has been moved around several times while the data it pointed to stayed where it was first stored (unless you implement different memory pools for array allocation units and data, in which case you don't get that added cost). The interesting thing is, however, what the languages look like after you decide to represent lists as arrays that way. E.g., cdr-ing down a list can easily be emulated with displaced arrays, i.e., arrays that really use another array's data for its elements, but starting from an offset for elemen t0. (Hence the importance of storing type bits in the pointer to the allocation unit for meta-object and data array, but keeping them of the same size and from the same allocation pool.) You may find that cdr-ing down an array allocates meta-objects, while cdring down a list does not, because the _rest_ of an array is no longer just a pointer to a cons cell, but a new displaced array. If you want to push and pop elements on lists, you find that you want lists that grow an shrink on different ends, leading to different coding of many algorithsm, or you can be really clever and build the list using displaced arrays that fill the allocation unit from right to left and move them around instead. Many cool things that affect what is efficient and convenient in surprising ways once you start using it. I did most of this work for a large automotive distributor back in the late 80's when I had one of my many "pity I can't find a Lisp I can use" periods and had major use for things Lisp makes easy. The system was in use until they went from a dial-in terminal-based system over both X.25 and regular phone lines to a web-based thing. I was greatly amused when the new system cost 50 times more money to build (adjusted for inflation and computer geek costs), required 30 times more memory, processing power, and bandwidth to handle the same load, and almost 20 times more code to get the same work done. My system paid for itself in 6 months of operation. I doubt this new system ever will pay for itself before it is replaced or run into maintenance costs that will keep it in the red for the rest of its life, but it's hot and flashy and the users are very happy with the nicely laid-out web pages. The users are also less efficient using it, though. Modern technology! But I digress nostalgically. #:Erik -- "When you are having a bad day and it seems like everybody is trying to piss you off, remember that it takes 42 muscles to produce a frown, but only 4 muscles to work the trigger of a good sniper rifle." -- Unknown