File Formats
File Formats FAQs
The subject of file formats is an important topic for repositories, and as such, creating a robust policy at the inception of the repository service is important. However, file formats are constantly changing and policies therefore need to be flexible. This section contains good practice guidelines about considering and creating file format policies.
Why are file formats important?
Repositories are living archives. In terms of the support it must provide for stored files, it must take into account two important functions of the files it holds:
- Access: The files are held so that users can access them. This means that they must be stored in formats that can be used by today's intended audience
- Preservation: The files are held so that users in 5, 10, 50, or more years can still access them. This means that they must be stored in formats that can be used by future audiences, or in formats that can easily be migrated
These two considerations are not always complementary. A file format that is good for access today may not be a format that is easy to migrate, but a format that is easy to migrate may not be easy to read. Take, for example, a journal article. It could be argued that a good format for easy access today is a PDF file. However, were PDF files to drop out of fashion, it could also be argued that the PDF specification is not easy to migrate in the future to a new file format. On the other hand, storing an article in marked-up xml is good for long-term access as it can be easily converted to a presentation format. It does not, however, in itself make a good format for easy access by end users.
Guidelines
Useful tools
PRONOM - run by the UK National Archives and provides a registry of information about file formats
DROID - also created by the UK National Archives and allows the batch identification of file formats
JHOVE - a tool to assist with the identification and validation of different file formats
PLANETS - The EU funded PLANETS project produces tools to assist with the long-term preservation of digital content
Here are some general guidelines to help in the creation of a file format policy:
- Collection or deposit policy: If the decision is made that the repository will only accept deposits where the deposited files meet a strict file format criteria, then some users may be put off depositing their content. Ensure that users can deliver files in the preferred format, otherwise the risk is run of not being able to collect files. If users are not able to deliver the preferred file formats, consider putting workflows in place that allow the repository administrators to perform the file conversions. It is unlikely that this type of approach would be appropriate for learning and teaching materials as these could cover many formats, even within one package. Guidance about file formats and preferred formats could still be offered to depositors.
- Store multiple versions: As mentioned previously, file formats that are good for access today may not be good for access in the future. Consider therefore storing multiple versions of the file. If, for example, a file created by a word processing program is deposited a plain text version and a PDF version could also be included.
- Ensure you know what file formats you hold: If files are stored which are specialised and cannot be converted into more generic formats, ensure that precise details of the file format are archived along with the file. Such metadata might include the name and version number of the software used to create the file. If the software is specialised, archiving a copy of the software along with the item is good practice. If this is required, the same principles apply to the software. Ensure a copy of the operating system that is required to run the software remains accessible.
- Consider your repository in the wider information environment: Any new repository will typically sit amongst many other platforms run by institutions. These may include content management systems, documentation systems, virtual learning environments, video streaming servers and file servers. It is likely that the repository is intended as an archive for materials from many of these systems, however it may not be the most appropriate place from which people access materials. Consider a video file. It would be prudent to store an uncompressed archival version in the repository, but access to that video today may be better provided by making use of a compressed streamed version from your video streaming server
- Offer multiple levels of support: It may only be practical to support a few different file formats. This can be catered for by offering varying levels of support. Designating a few formats as 'supported' means that an attempt, or guarantee, will be made (depending upon your policy) to migrate these over time to ensure continuing accessibility. Other formats may be designated as 'known' which means they are 'recognised' but are not guaranteed to be migrated over time. Other formats may be described as 'unsupported'. In this case the guarantee is to preserve the file as it was when deposited
- Preservation policy: A file format policy should form part of a wider preservation policy. The preservation policy may influence your choice for file formats if it dictates time durations during which files must still be able to be read, versus a policy which just recommends 'best endeavours'
- Data files: If data is to be stored, ensure that the data is adequately described. For example if storing results from a questionnaire, it may be obvious that a column of data marked 'sex' which contains responses of either 'm' or 'f' relate to 'male' or 'female', however a column marked 'socio-economic status' with values 'a', 'b', 'c', 'd', 'e' and 'f' is not so obvious. Ensure a data dictionary, ontology, or coding scheme is stored with the data
- Plan for change: The file formats in use change over time, as do versions of software. It is important that the file format policy adapts over time to ensure that it stays up to date with current file formats and versions
- Be practical: Being overly-strict about file formats may mean collecting no files leading to an empty repository! A sensible approach must be used that weighs up the cost and benefits of different file formats and the effort required to convert between them.





