Support for MARC Records in PostgreSQL

Sitting in one of the nice pubs in Edinburgh during KohaCon, I had the idea to add MARC records as a proper datatype to the PostgreSQL database server. After a discussion with Marc Véron and Dobrica Pavlinusic about what that could mean, I decided to just try it and I have now a basic implementation (or, more a proof of concept). So here is some information on this:

If MARC records are a proper datatype, that means they are stored right in the database, are backed-up, can be restored, replicated etc. just with the standard database tools. If then a function is provided toaccess individual fields of a MARC record, then this can be used in SQL expressions, e.g. for selects or to create views etc. As PostgreSQL supports functional indexes, you can create indexes on individual MARC fields, giving you super fast access to your data.


How does it look?

There is a datatype called 'marc' for now:

CREATE TABLE books (
id serial,
sig varchar(16),
marc_record marc
);

MARC records are loaded into the database as raw, binary records, encoded in hexadecimal:

INSERT INTO books (sig, marc_record) VALUES ('a01b', '3030383830.....');

To access individual MARC fields, the function 'marc_field()' was created, it returns VARCHAR:

SELECT id, marc_field(marc_record, '020') AS isbn FROM books WHERE sig =
'a01b';

Of course MARC fields can be used to search data:

SELECT SELECT id, marc_field(marc_record, '020') AS isbn FROM books
WHERE marc_field(marc_record, '245') like 'whatever%';

An index on a specific field can easily be created:

CREATE INDEX books_isbn_idx ON books (marc_field(marc_record, '020'));

As a syntactic sugar, the expression marc_record@'020' is equal to marc_field(marc_record, '020').

Marijana Glavica kindly let me use a MARC database of ~250'000 records to make some tests, here are some real world examples:

books=# d test_marc
Table "public.test_marc"
Column | Type | Modifiers
--------+---------+--------------------------------------------------------
id | integer | not null default nextval('test_marc_id_seq'::regclass)
marc21 | marc |

books=# select count(id) from test_marc;
count
--------
246727
(1 row)

How many croatian books do they have?

books=# select count(id) from test_marc where
substring(marc_field(marc21, '008'), 37, 3) = 'hrv';
count
-------
52582
(1 row)

Of course much more complex queries are possible with this, and usingthe right indexes it is really, really fast. One query I tested whenfrom 8.4 seconds to 0.21 ms with the right index. That's a speedup of 40'000.


Thanks to Marijana, Dobrica, and Marc for feedback, interesting discussions, and ideas!