A simple guide to using shared drives to capture & classify electronic documents and emails

by Frank 18. July 2014 06:00

I have written previously about ways to solve the shared drives problem (click here) and I have written numerous articles (and a book) about ways to manage emails and electronic/digital records. However, we still receive multiple requests from customers and prospective customers about the best, and simplest, way to effectively manage these problems.

The biggest stumbling block and impediment to progress in most cases is the issue of a suitable taxonomy or classification system. Time and time again I see people putting off the solution while they spend years and tens of thousands or hundreds of thousands of dollars grappling with the construction of a suitable taxonomy. I have written about this topic previously as well and if you want my recommendations please click on this link.

If you really want the simplest, easiest to understand, easiest to use and lowest cost way to solve all of the above problems then please forget about spending the next twelve to eighteen months grappling with the nuances of your classification system. It isn’t necessary.

What you need instead is a natural classification structure that reflects your business processes. Please give your long-suffering end users something they will instantly recognize and can easily work with because it is familiar from their day to day work. Give them something to work with that doesn’t require them to become amateur records managers battling to decipher a complex, hierarchical classification system that requires an intricate knowledge of classification theory to interpret correctly. Give them something that makes it as easy as possible to file everything in the right place first time with absolutely minimal effort. Give them something that makes it as easy as possible to find something.

What I am proposing isn’t a hundred-percent solution and it won’t suit every organization but I guarantee that it will turn chaos into order in any organization that implements it. You may well see it as an eighty-five-percent solution but that is a hell of a lot better than no solution. It is also easy and fast to implement and relatively low cost (you will need some form of RM software).

First up you need to make decisions about what kind of business you are. Notice that I said “what kind of business you are” not “what kind of records you manage” or “how your business is structured”. Most importantly, strongly resist the temptation to base your classification structure on your existing business structure or organization’s departments/agencies and instead base it on your most common business processes. Please refer to the following extract from:

Overview of Classification Tools for Records Management by the National Archives of Australia, ISBN 0 642 34499 X (an excellent reference document if you need to understand classification systems).

“Classifying records and business information by functions and activities moves away from traditional classification based on organisational structure or subject. Functions and activities provide a more stable framework for classification than organisational structures that are often subject to change through amalgamation, devolution and decentralisation. The structure of an organisation may change many times, but the functions an organisation carries out usually remain much the same over time.”

I would also strongly resist the temptation to build your classification structure on content; it is way too difficult. Instead, as I have said above, base it on your common business processes.

When I say classification structure I mean the way you name and organize folders in your shared drives. I can’t give you a generic solution because I am not that clever; I don’t know enough about your business. I can however, give you an example.

Please also remember that for the most part, we are dealing with unstructured source information; Word, Excel, PowerPoint, Emails, etc. Emails are a little easier to deal with because they have a limited but common structure, e.g., Date Received, Sender, Recipient, CC and Subject. With other electronic documents we are have far less information and are usually limited to Author (not reliable), Date Created, Date Modified and Filename. Ergo, as I said earlier, trying to base a classification system on the content of unstructured documents is both difficult and inexact. It is certainly doable but you will have to spend a lot more money on consulting and sophisticated software to achieve your ends.

In my simple example of my simple system I am going to assume that your business is customer (or client) centric, i.e., as opposed to being case-centric or project-centric, etc. The top level of your classification structure therefore will be the client name and/or number. To make it as simple as possible I am going to propose only two levels. The second level represents your most common business processes, that is, what you do with each customer. So for example, I have:

Customer Name

Correspondence

Contracts

Quotes & Proposals.

Orders

Incidents

I am also not going to differentiate between emails and other types of electronic documents, I am going to treat them all the same.

Now how does this simple system work?

Staff producing electronic documents don’t have their ‘own’ shared drive, all staff use the common classification structure. This is very important, let one or more people be exceptions and you no longer have a system you can rely on to meet your needs for reliable retrieval and any compliance legislation you are subject to.
Staff drag and drop or ‘save-as’ emails from their email client to the correct sub-folder.
Similarly, staff save (or drag and drop) electronic documents into the correct sub-folder. You can control access if required by applying security to electronic documents.
You purchase or build a document repository (based on any common database such as SQL Server, MySQL, etc.) and within this repository you replicate the folder structure of your shared drives with logical folders and subfolders.
You purchase or build a tool that constantly monitors the shared drives (e.g., using .NET Watcher technology) and that instantly captures a copy of any new or modified document (you do need to configure your repository to automatically version modified documents). You may also decide to automatically delete the original source document after it has been captured.
You build or purchase a records and document management software package that allows you to index, search and report on all the information in your repository.
You train your staff in how to save and search for information (shouldn’t take more than a half to one day) and then you go live.

I would also recommend applying a retention schedule based on sub folder (e.g., contracts) and date created and have the records management system automatically apply it to manage the lifecycle of captured documents. There is no sense in retaining information longer than you have to; it is also a dangerous practice.

Please note that the above is just an example and a very simple one at that. You need to determine the most appropriate folder structure for your organization.

WARNING

Do not let the folder structure become overly complex and unwieldy. If you do, it won’t work and you will end up with lots of stuff either not captured or captured to the wrong place. The basic rules are that if it takes more than few second to decide where to file something then it is too complex and that any structure more than 3 levels deep is too complex.

And finally, this isn’t just a theory, it is something we do in our organization and it is something many of our customers do. If you would like to read more on this approach there are some white papers and more explanations at this link. Alternatively, you can contact us and ask questions at this link.

Good luck.

b56bfe8e-7687-4463-8cdc-416a86a37de2|0|.0

Tags: Classification, Clasification System, Classification Scheme, Thesaurus, Keyword Thesaurus, Taxonomy, Ontology, Records Management, Document Management, Electronic Document Management, Electronic Document and Records Management, Email Management, Email Archiving, RecFind, RecFind 6

Do you really need a Taxonomy/Classification Scheme with a Records Management System?

by Frank 26. October 2013 06:00

Background

Classification schemes are a way to group or order data; the objective being to group ‘like’ objects together. Classification schemes have been in use for tens of thousands of years, probably beginning when man first realized that there were different types of animals and plants.

We use classifications schemes both to make things easier to find and to add value to a group of objects. By adding value I mean that a classification (describing a group) may provide more information about the members of that group that is obvious from an analysis of a member; this could be referred to as semantics.

Classification schemes are used in all walks of life, for example; in business, in science, in academia and in politics. Are you a liberal or a conservative? Is it a mammal? If it is, is it a marsupial or a monotreme or a placental mammal? This last example illustrates the usual hierarchical arrangement of classification schemes.

In business, we have long used classification schemes to order business documents, that is, records of business transactions. We are all familiar with file folders and filing cabinets; these things are tools of a classification scheme. They make implementing a classification scheme easier as do numbering systems, colors, barcodes and Lektrievers.

With the first commercial availability of mainframe computers in the early 1960s came our first attempts to computerize filing systems. It was also in the 1960s that we saw the first text indexing systems and the first sophisticated search algorithms.

The advent of text indexing and search algorithms allowed us to do a much better job of classifying data but more importantly, they allowed us to do a much better job of finding data.

Let’s not get in a debate about terminology and acronyms

Our industry (information management to use an all-encompassing term) is often its own worst enemy. It creates terms and acronyms at will with both confusing and overlapping definitions. Then it wonders why normal end–users exhibit first bewilderment and then disinterest. Let’s look at a few examples, e.g., RIMS, RMS, DMS, EDRMS, IAMS, CMS, ECM and KMS.

Do you realize that the process of records management is part of each of the preceding acronyms?

For my part I will stick with my old friend the world records management standard, ISO 15489. It tells us that records are evidence of a business transaction and that records are in any form including paper, electronic documents and emails (I know emails are electronic documents but the world generally differentiates them because emails are ‘different’).

So as far as I am concerned the term Records Management System or RMS includes everything we do and is easily recognized and understood so this is the term and acronym I will use in this paper.

Browsing versus searching

Classification systems are very good at making it easier for us to find information by browsing but not very helpful when we are searching.

Most classification systems require you to first ‘browse’ before finding the exact information you want; you usually have to examine multiple objects before you find the one you want. But this is what classifications systems are very good at; because they organize data in a logical (to a human being) way, we usually know where to begin looking. This is why a classification scheme works so well with a manual filing system (multiple cabinets or multiple shelves of file folders)

Classification schemes are great for physical data and, I would say, absolutely necessary for physical data; how else would you organize fifty-thousand file folders (containing seven and a half million pages) in a huge filing room with hundreds of shelves?

However, with computers I don’t need to browse through multiple objects to find the one I want. By using techniques more appropriate to the computer than the filing room, I can search for and find exactly what I want almost instantly. I do not need to leaf through the file folder, I can go directly to the page or directly to the word. I can use the power of the computer.

The following statement will be probably seen as heresy by most practicing records managers but we actually don’t need a classification system (Taxonomy) when computerizing records. We just need a way to index and then search for information.

We need to organize our data so an ordinary end-user can easily find what they need without having to be a trained, professional records manager.

Indexing versus classifying

Now I know my interpretation of these two terms will not thrill everyone but the differentiation is an important part of my hypothesis.

Let’s start by looking at two kinds of books, a reference book and a work of fiction. Both have tables of content (a classification system usually called a TOC) but only one (the reference book) has an index (usually).

The TOC for the reference book is both useful and often used. The TOC for the work of fiction is both not useful and rarely used (readers rarely need more than a bookmark).

The TOC for the reference book is way to organize information into a logical form grouping ‘like’ information together in chapters and sections. A TOC for the work of fiction is just a list of chapters; it serves little or no purpose for the typical ‘end-user’, the reader.

All the reader of a fiction book really needs is two things; a bookmark and a ‘memory’ of the author, title, cover combination so he/she doesn’t accidentally buy it again at the airport bookshop before that dreaded long and boring flight.

The reader of the reference book actually needs both the TOC and the index for browsing (the TOC) and searching (the index).

A work of fiction doesn’t usually have nor need an index because the end-user doesn’t require it. A reference book usually has an index and it is often used to go direct to a page (or pages) and locate something very specific.

Drawing parallels with our broader topic, some information needs both a classification system and an index, some information needs just an index and some doesn’t require either (e.g., works of fiction).

Generally speaking, scientific collections require a classification system (a scientific taxonomy); for example, the study of plant species and the study of animal species (e.g., using a phylogenetic classification system). Scientists simply could not communicate with each other without having a detailed and exact classification system in place. But, most end-users are not scientists; they are just people trying to find the best place to store something and want to find it again with the least amount of effort and pain.

My contention is that we can solve all ‘content management’ and records management needs with a solution based on the application of a sensible, simple and self-evident (read that as easy to use or human-oriented) indexing system plus the required searching capabilities (i.e., covering both Metadata and full text). There is a better way.

What indexing system?

Whenever I consult with customers who are contemplating the capture and organization of data (hopefully into information) I always give the same advice. That is, “When you are thinking about how to index data first think about how you will find it later.” Ask this key question of your end-users, “When you are about to search for information what do you usually know about it?” For example:

Do you know the last name?
Do you know the first name?
Do you know the date of birth?

A good indexing scheme reflects real life usage of the system; it reflects how ordinary humans work and ‘see’ information. Put simply, it indexes the information people will later need to search on. It indexes the information people understand and are comfortable with because it is self-evident.

Indexing Emails

An email is usually described as an unstructured document (the same way a Word or Excel document is described as being ‘unstructured’) but in fact it does have structure. Even better, everyone is familiar with an email’s structure so we have very little to teach end-users; that is, we have a simple and self-evident ‘natural’ set of Metadata items to index.

Date of email
Sender
Recipient
CC
BCC
Subject
Text of the body of the email
Text of any attachments

For any normal end-user trying to find an email this is how they would envision an appropriate search. They wouldn’t care that the email has been classified down to 6 hierarchies using the world’s most sophisticated Business Classification Scheme (BCS).

Understanding what end-users typically ‘know’ before they do a search determines what elements you have to index. This is the key to implementing a successful indexing system.

The above 8 elements of an email are self-evident insomuch as, “Of course I need to be able to search on the sender or recipient or subject….”

Indexing Electronic Documents

Now let’s look at ordinary electronic documents (i.e., not emails) because they are much less structured. We all know there are ways to add a common structure using features of MS Office like the information dialog box (asking for keywords etc) and templates and smart tags but these things are rarely and inconsistently used.

With shared drives we usually find some form of ‘evolved’ classification system because managing electronic documents in shared drives is akin to managing millions of pieces of paper in tens of thousands of file folders in hundreds of filing cabinets. Unfortunately, the good intentions and purity of design of the original architects of the shared drives folder/sub folder naming conventions (a classification system) are soon corrupted as users make uncoordinated changes and the structure soon becomes unwieldy and incomprehensible.

In my opinion shared drives are OK for the creation of documents (i.e., a work area) but not OK for the management of documents. In fact I would say shared drives are absolutely hopeless for the management of documents as history and practice will attest.

Once again we need an appropriate indexing system and once again we need to ask, “What do people know at the time of the search?” For example:

Original filename
Original path/filename
Type/suffix – e.g., .DOC, .XLS, .PDF, etc
Author
*Subject

Metadata and the Dublin Core

Let me quote from the Dublin Core website:

http://dublincore.org/

“The Dublin Core Metadata Element Set is a vocabulary of fifteen properties for use in resource description. The name "Dublin" is due to its origin at a 1995 invitational workshop in Dublin, Ohio; "core" because its elements are broad and generic, usable for describing a wide range of resources.”

To quote Wikipedia:

http://en.wikipedia.org/wiki/Dublin_Core

“It provides a simple and standardized set of conventions for describing things online in ways that make them easier to find. Dublin Core is widely used to describe digital materials such as video, sound, image, text, and composite media like web pages.”

The Simple Dublin Core Metadata Element Set (DCMES) consists of 15 elements.

Title
Creator
Subject
Description
Publisher
Contributor
Date
Type
Format
Identifier
Source
Language
Relation
Coverage
Rights

To my mind the Dublin Core is an excellent set of elements for describing almost any ‘record’ because it is both simple and appropriate to both computers and ‘normal’ end-users. As a professional, I like the elegance of the Dublin Core.

I also like the basic principle because it fits in with my hypothesis. That is, there is a better way to store, index and find records than a complex and unwieldy Taxonomy.

The Full Solution?

We need an application that stores documents of all types, i.e., all types of content.
We need an application that indexes both Metadata and full text.
We need an application with a customer configurable Metadata model.
We need an application that allows you to search on both Metadata and full text in a single search.
We need a search that combines BOOLEAN and numeric operators, e.g., AND, OR, NOT, =, <, >, etc.
We need a ‘standard’ Metadata definition (a Class if you will) that includes a simple (not more than 20 in my estimation) set of data elements that includes all of the elements necessary to index all of the types of documents (including file folders and paper) that you manage.
We need an application that includes all types of data capture, e.g., from the file system, from the native application, from a scanner, etc.
We need an application with a comprehensive security system.
We need an application with all reporting options, e.g., both standard reports and ad hoc reports.
We need an application with a configurable audit trail.
We need an application with comprehensive import and export capabilities.

The standard Metadata definition (Master Metadata Class)

I have come up with a limited set of elements that I believe can be used to index and find any type of record, paper or electronic. I have borrowed heavily from the Dublin Core because it makes good sense to do so; there is no need to reinvent the wheel.

#	Element	Explanation
1	Title	A name given to the record. Typically, a Title will be a name by which the record is formally known. Text, e.g., "Business Plan for 2010"
2	Author(s)	The sender or author, E.g., Mark Twain or f.mckenna@k1corp.com
3	Dated	The original date of the document or published date
4	Date Received	Date received by the recipient or recipient's organization, whichever is the earlier
5	Original Name	e.g., filename or file\pathname for electronic documents - C:\franks stuff\sample.xls
6	Primary Identifier	An unambiguous reference to the record within a given context. E.g., The file number
7	Secondary Identifier	An unambiguous reference to the record within a given secondary context. E.g., The case number or contract number or employee number
8	Barcode	Barcode number or RFID tag
9	Subject	The topic of the record. Typically, the subject will be represented using keywords or key phrases. Recommended best practice is to use a controlled vocabulary.
10	Description	An account of the record. Description may include but is not limited to: an abstract, a table of contents, a graphical representation, or a free-text account of the record.
11	Content	Words or phrases from the text content of the main document and attached documents
12	Contents	Description of contents if the document is a container, e.g., an archive box
13	Recipient(s)	Addressed to, sent to etc. People or organizations.
14	CC recipient(s)	CC and BCC recipients
15	Publisher	An entity responsible for making the record available. Company or organization that either published the document or that employs the author
16	Type	The nature or genre of the record, usually from a controlled list, e.g., complaint, quotation, submission, application, etc.
17	Format	The file format, physical medium, or dimensions of the record. E.g., Word, Excel, PDF, etc
18	Language	e.g., English, French, Spanish
19	Retention	The retention code determining the record’s lifecycle
20	Security	Access rights, security code, etc

My contention is that by using an ‘index set’ like the above 20 Metadata elements you can index, manage and retrieve any ‘record’ regardless of form and content.

What about all the standards ‘out there’?

There is a plethora of local, state, federal, industry and international standards pertaining to the management of records. Examples are DoD 5015, MoReq2, Dublin Core, ISO 15489, VERS etc and literally thousands of standards for Metadata.

The problem with most of these standards is that they are extraordinarily difficult to read and understand (even the Dublin Core documentation can be heavy going). I would draw a parallel back to the times when the Bible was in Latin but Christians were supposed to order their lives by its teachings. The problem being that only about 0.025% of Christians spoke Latin. Ergo, how do you order your life by a book you can’t read?

My assertion is that most records managers do not fully understand the standards they are charged with enforcing.

The problem isn’t with the records managers; it is with the people who write the standards. The standards are not written for records managers, they are written for academics and technical people (i.e., systems engineers who are experts in XML). Just like the Latin Bible, they are not written in the language of the intended user.

And even when you do think you have a grasp of the fundamentals there are always multiple points to be clarified (as to the exact meaning) with the standards authority.

What about Retention/Disposal schedules?

This should probably be the subject of another paper because retention schedules have also become way too complex, unwieldy and difficult to understand and apply.

The question will be, “How can I do away with my classification system when my retention codes are linked to it?”

I have looked at hundreds of retention schedules and every single one has been way too complicated for the organization trying to use it. Another problem is that very few of the authorities that compile retention schedules do so with computers in mind. This means that we end up with lots of very vague conditional statements that are almost impossible to computerize.

Most retention schedules are written for archivists to read, not for computers to process. This is the heritage of retention schedules; they assumed an appraisal process by a trained and expert archivist.

The Continuum model or ‘Whole of Life’ model or File Plan model all assume we will allocate a retention code at the time the record is created, not during a later appraisal process. This made much more sense and allowed us to better manage the record throughout its life cycle. However, many such schemes also linked the retention code to a classification term or embedded the retention codes within the classification system. This of course made the classification system even more complex and difficult to understand and apply.

To my mind no organization needs more than ten retention codes (shortest period, longest period and eight in between) and three life cycles (e.g., active, inactive, destroyed). This is also probably heresy to a lot of the records management profession but, I would ask them to think about the proposition that something that was entirely appropriate to the manual world is not necessarily entirely appropriate to the computerized world. There is an easier and simpler way to manage retention and there is no need to embed retention codes into the classification system just as there is no need for a classification system in any modern, computerized records management system.

What about File Folders and Archive Boxes?

This is the classic stumbling block. This is when the records manager tells you that all the standards require you to use the same taxonomy for emails and electronic documents that he/she uses for traditional file folders and archive boxes.

You need to explain that the classification from the manual paper handling world is inappropriate to the computerized world, that it is an anachronism. You need to explain that all it will add is complexity, massive cost, confusion and a seriously negative attitude to end-users. You should say it is time to discard techniques and tools from the eighteenth century and adopt techniques from the twenty-first century. You should say you have a much better way. Then you should probably duck and run. Failing all else, blame me and give them my email address.

c8e92969-0a9e-4435-8fe6-5fe52dee9701|0|.0

Tags: records management, document management, Taxonomy, Classification System, Enterprise Content Management

	K1Corp Blog
Comments from K1Corp staff about all things Information Management