Maximising utility of content stores

  • : Function split() is deprecated in /hsphere/local/home/guruj/bounds.net.au/modules/filter/filter.module on line 895.
  • : Function split() is deprecated in /hsphere/local/home/guruj/bounds.net.au/modules/filter/filter.module on line 895.

On actKM, Richard Vines recently posed a question on how we can maximise the value of ICT investments in content/metadata management systems. All too often systems are implemented without a view to their eventual decommissioning or integration with other systems. This inevitably leads to expensive, error-prone and disruptive data synchronisation and/or migration projects down the track (and often not as far away as people imagine).

To answer Richard's question, the following analysis suggests principles for systems design to maximise utility and minimise redevelopment in this situation:

    1. The cost of investing in content (and metadata) for any system can be broken down into the following stages:
      a. creation/capture
      b. encoding
      c. cataloguing
      d. presenting for use
      e. archiving for future use

    2. Each of these stages in turn consists of:

      a. the upfront cost in developing a methodology for that stage, and
      b. the incremental cost of executing the methodology for an item.

    3. When moving or replicating items to another system, only the initial creation/capture stage is unnecessary. Unless your systems are 100% compatible, you will need to decode your content and catalogue entries to an interchange format, re-encode, re-catalogue, re-present, and re-archive. Standards are great at minimising the effort of doing this.

    4. An archive is an active, ongoing effort to maintain the accessibility of items held within it with a *presumption* that format migration will be necessary at some point. This means:

      a. holding instructions on how to encode, decode, and present every type of item held;
      b. having standardised metadata describing items and collections of items, including preservation and (where historical/evidential/recordkeeping issues matter) provenance information.

    5. To minimise archive maintenance work, archive items are often standardised into only a few formats. While software is an alternative for algorithmic instructions to present items, this then requires a complete hardware infrastructure for executing the software which may itself be brittle and susceptible to obsolescence.

    6. Even if a formal archive is not created, a defacto archive can be realised if systems are designed to read from and export to standardised archive formats. Where these archive formats are completely or almost completely lossless, they in turn will become defacto interchange formats.

    7. Where a significant degree of content standardisation can be achieved, a bow tie ecosystem emerges. This is scalable, robust and evolvable across large, heterogenous environments but at the cost of extreme disruption should these core interchange (archive) formats be supplanted.

The true cost of an ICT system should include the costs involved in decoding to interchange formats, and the effort involved in maintaining the archive. The ultimate vitality and usefulness of federations of systems with similar purposes will be determined by the health of the "bow tie ecosystem" that enables new systems to leverage existing systems and systems data.

Did you know...

Our expertise in complex systems analysis, combined with a deep understanding of technology and modern, agile management and leadership techniques makes knowquestion uniquely positioned to find strategic solutions to your tough problems. Contact us today.

Comments

Richard VInes (not verified) — Sat, 19/11/2011 - 09:05

Hi Stephen,

I have found the above a very interesting and useful summary and have thought quite a bit more about it since you posted this. One of the things I found interesting is this part of your "post:

"Where a significant degree of content standardisation can be achieved, a "bow tie ecosystem" emerges. This is scalable, robust and evolvable across large, heterogenous environments but at the cost of extreme disruption should these core interchange (archive) formats be supplanted".

Your reference to the idea of "disruption" is what I mean in considering this notion of “sunk costs”. Thus, I am not sure if the idea of maximising the value of ICT investments is necessarily the overarching "best way" of thinking about the logic of such investments. Everyone wants to enhance the value of ICT (including content) investments, but no one wants the disruption and the cost of transition.

Does a notion of value now adequately include the benefits of the "bow tie ecosystem" and moving towards this (as well as the risks of a bowtie ecosystem as outlined in the blog you quoted – the risks of parasitic predation as discussed)? These will be very interesting spaces to watch in terms of the Public Sector Information Release Frameworks for example being discussed.

I still think that developing some criteria for avoiding ICT investments becoming sunk costs is warranted (and thus contributing to the maximisation of value). And, in thinking this through, I would argue that this requires a conceptual framework slightly beyond what you describe above. Perhaps it might also need to involve commitments to describing some context around content (temporal and institutional metadata, and perhaps also spatial metadata). In a larger societal sense, I think there would also need to be commitments to public metadata registries associated with regulation frameworks (or even voluntary industry agreements) to avoid the perils of a predisposition of bow tie ecologies towards parasitic predation (to take into account this point you talk about in your post).

Thus, I think to address the problem of sunk costs of content actually requires some commitments to context metadata and public metadata boundary objects that span enterprises, community organisations and government.

Take for example, some of the context surrounding this section of a book (ID http://www.findandconnect.gov.au/vic/bib/P00000157.htm).

Some of the context (and perhaps even the idea of making explicit some aspects of the idea of “practice in context”) is navigable through related entries and related publications that appear on this particular site for this book section. The book section entry also has a persistent identifier.

We also know that this book section relates to the organisation Melbourne Orphan Asylum that has now publicly been ascribed this persistent identifier: http://www.findandconnect.gov.au/vic/biogs/E000180b.htm . In principle, this description of the Melbourne Orphan Asylum could act as a ",boundary object" for other multiple information systems that might be collecting information about this same organisation.

I am still yet unclear as to whether we can collate the sorts of things you describe and the sorts of things I have referred to above (including the possibility of establishing some basic value orientated criteria) so as to point towards ICT investments that lead to resilient, sustainable content management that avoids the idea of content investments as representing a sunk cost. (By the way, it is not just the cost of content that becomes a sunk cost, it is also the cost of the practice in context activities that also potentially become a sunk costs as well. I have seen this problem so often when researchers grapple with the problem of ensuring their researdh reports and datasets can remain accessible to policy makers over periods of time).

I think you have made a brilliant start. Thanks.

Richard

Stephen Bounds — Sat, 19/11/2011 - 16:00

Hi Richard,

Thanks (as always) for your thoughtful commentary. One thing that may not have sufficiently come through is that a bow tie ecosystem is a Good Thing and one of the most robust systemic structures available. You should read the original Csete and Doyle article if you haven't already to get a better sense of how it works.

Parasitic predation refers to the fact that with standardised exchange blocks, it's far easier for an external agent to learn how "steal" blocks for their own purposes. Typically a system evolves one or more mechanisms to defend the bow tie part of the system against predators (for example, the body's immune system).

But when we're talking about open access information, where the primary goal is to reduce costs of supply to others, predation is far less problematic. What's the worst that could happen? Harvesting of content by spiders/bots, or possibly denial of service using standard access channels. Both of these are reasonably easy to manage through standard security techniques.

I agree with your concept of capturing context around content. For my money, the generic requirements of this space are best discussed in the OAIS archival model.

For the specific "Find and Connect" example referenced above, I see two major problems:
(1) there aren't any "true" persistent identifiers used to represent the information (eg DOI, PURL, ARK)
(2) the information is presented in HTML format rather than a semantic markup format (eg RDF). Although there appears to be attempt to use microformats in the markup, it's incomplete and not documented. Combinig OAIS with principle (6) above would dictate that instructions for extracting/decoding the data representation needs to be associated with the pages in order for them to become a defacto archive.

Richard VInes (not verified) — Sat, 19/11/2011 - 17:36

Hi Stephen,

The link to the Csete and Doyle article works this time, so I will read. I read the notion of predation in a different way.

Yes, I am aware of the OAIS archival model. The Find and Connect Model is in alignment with the metadata harvesting requirements of the National Library’s Trove – contribution model – in fact more than this it complies with the next generation requirement associated with EAC-CPF direct ingest protocols (see http://trove.nla.gov.au/general/contribute/).

You are only partially right about the two major problems. Yes, the links I gave you are not true persistent identifiers – but actually that was my point. There is a need for public metadata registries that provides this sort of service to overcome current problems. This was one of the purposes of this project: https://wiki.nla.gov.au/display/ARDCPIP/ARDC+Party+Infrastructure+Project. It will be interesting to watch if industries agree to collaborate around these sorts of problems. I think the lack of this sort of infrastructure is one of the reasons why large scale public infrastructure (ICT) projects fail or become second class defacto standards that kill innovation.

Re the issue around HTML, - yes the data is rendered as static HTML pages separate from the data. But the data can also be exported as XML (as EAC XML- i.e. semantically marked up) for metadata sharing and a range of other types of services including ingest services via OAI.

Re your point about RDF, that is a topic for another time.

Thanks,

Richard

Stephen Bounds — Sun, 20/11/2011 - 05:59

Hi Richard,

Firstly and most importantly, what you describe sounds fantastic and a great achievement. I assume you've had some direct or indirect involvement with the project?

I am curious though as to why you feel a "second class defacto standard" would "kill innovation"? It sounds like you are concerned about what amounts to a premature optimisation ... but one of the beauties of structured or semistructured data is that it is quite feasible to present the same information in a number of formats at a marginally greater cost.

So why would it not be better, for example, to:

  • Create a system for generating ARK IDs for each page and make these visible or available in the metadata to guard against systems migration to new URLs in the future
  • Allow users to directly download the EAC format by appending the query parameter '?eac'
  • Present the information as RDF/XML markup in addition to EAC
  • Publically declaring which standards you choose to adhere to on pages
  • etc...

Each of these would provide additional vectors for systems integration or standardisation, and hence more chance of evolution towards a bowtie ecosystem.

Richard VInes (not verified) — Sun, 20/11/2011 - 06:55

Thanks Stephen,

The origins of this project have multiple seeds …. one important one is described in this chapter on knowledge brokering. http://epress.anu.edu.au/knowledge/pdf/ch05.pdf

From a technology narrative perspective, an important story is told in this piece.

McCarthy, G. and Evans, J. 2007. Mapping the socio-technical Complexity of Australian Science: From Archival Authorities to Networks of Contextual Information. In Respect for Authority: Authority Control, Context Control an Archival Description 2008. Journal of Archival Organisation, Volume 5, Numbers 1-2, 2007

The theme of “defacto standard” relates to experiences in the child welfare sector where systems that have originally been designed for internal use within departments are then deployed to partner agencies that delivery services on behalf of government. This is done without putting in place the sort of public knowledge infrastructure that I am talking about.

The impact on practice related innovation is profound, as is summarised in the above article on knowledge brokering. As larger and larger amounts of investment are accrued so too does the prospect from a practice perspective that these investments should be treated as sunk costs.

But this does not happen because of the need to protect the value of the investments. In fact, your point about irreversible sunk costs becomes partially relevant, because sometimes important practice information cannot be retrieved at all – we are left with a defacto standard because there is no exit strategy available.

Your points are all excellent suggestions and I am sure things will go in that direction – so much of the knowledge used in doing stuff is implicit and is not made explicit. The only additional comment I would make is that innovation can often only be done when there is a market (and a funder willing to join in on the next step). I think we would both bemoan the lack of R and D culture (and commitments to negotiating public knowledge protocols) in the space we are operating in. It seems utterly amazing to me with the scale of the need across multiple industries and jurisdictions that we don’t have better institutional R and D going on.

I tried to highlight some of these “sorts of requirements” – an example is the piece I did on “Cooperative federalism, social inclusion and interoperability” downloadable at this URL, http://www.vcoss.org.au/doingitbetter/issues/interoperability.htm - written in 2008.

Thanks,

Richard

Stephen Bounds — Sun, 20/11/2011 - 14:48

I've always thought there are four critical factors mitigating against this kind of investment in public knowledge protocols. Each by themselves are manageable but taken as a whole they are a kind of Gordian knot.

The first is the problem of indirect benefit. Building a facility to share information with other people has no direct benefit to the builder; rather, it is only when reciprocity is established that a net benefit exists.

The second is the not-invented-here syndrome. It's very easy to get bogged down in the relative merits of one technical solution over another and to end up in a kind of stalemate since the "losers" will have to spend additional intellectual efforts in aligning with the "winner's" solution.

The third is the first mover disadvantage. As more people join a standard, there is a smaller incremental cost of participation, but the first movers will pay a substantial premium in developing software, protocols and platforms.

Lastly, when the above three preconditions apply, normally the best way to fix the problem is for government to intervene and bootstrap the process. But without the ability to test and compare the relative merits of various exchange solutions, the outcome is likely to be a camel that gets abandoned for delivering too few of the promised benefits.

Ultimately I think the solution has to originate from archives such as the National Library. But they have to commit to simplifying both sides of the equation, ie tools to plug in to databases to simplify conversion into XML EAC format, as well to import/consume the resources.

Richard Vines (not verified) — Sun, 20/11/2011 - 15:42

Perhpas what you are suggesting would be the knot of the bow tie.

If that is the case, I wonder if you have any thoughts on what the regulatory, control and feedback loops would consist of and whether there needs to be some extended networks involved beyond the archival representatives like the NLA to make something like this work?

Interesting conversation Stephen. Thanks.

Richard

Stephen Bounds — Sun, 20/11/2011 - 23:11

I agree Richard, I've found the conversation fascinating.

This comes under the heading of wild speculation, but:
- regulation: QA standards checking of data published in a given format
- control: assignment of registry IDs for a permanent URL schema
- feedback: adoption/non-adoption of the system by external parties

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <small> <blockquote> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <br>
  • Lines and paragraphs break automatically.

More information about formatting options