Frequently Asked Questions
What is the Archiving Program?
The State Government Website and Social Media Archive is a joint project of the State Library of North Carolina and the State Archives of North Carolina. It captures selected state government websites and social media accounts, specifically ones that have been suggested by their agencies for archiving and which have also been vetted by the State Archives’ Records Analysts as having historically significant information not available through other means.
The North Carolina State Government Web Archives uses the Internet Archive tool Archive-It to capture websites. Archive-It performs web crawls on websites to capture text, images, the structure and organization of data, and associated files (Word documents, PDFs, etc.) when possible. Archive-It can be used to crawl both traditional websites and social media. Several factors can interfere with Archive-It's ability to crawl a website, including website logins (as with some social media platforms) and data stored in databases. Archive-It can only access publicly displayed information, meaning it cannot capture private messages available to users of some social media. Web crawls are initiated at regular intervals by a member of the Web and Social Media Archive Committee. Crawls are typically initiated once every two months or, in the cases of infrequently updated websites, once a year. Crawls are regularly vetted for quality control by members of the Web and Social Media Archive committee.
Archive-It captures all embedded elements on a seed site page (including images, style sheets, JavaScript, PDFs, and so on) for up to 100 hops from the original seed page within the same host domain. Archive-It does not capture links to other sites or subdomains (such as axaem.archives.ncdcr.gov) unless the subdomain is entered as a separate seed or the primary seed has been entered in a very specific way.
Several factors can limit or expand the amount of data captured for each seed, depending on how often the site is updated and how much data there is to crawl:
- We crawl most sites every 2 months; sites for boards and commissions are crawled only once a year. Data that has been added and removed within the window between crawls will not be captured.
- Our crawls are set to expire after 7 days, so depending on the rate of data capture, there may be data missing from a completed crawl.
- Robots.txt exclusions built into websites and other code can limit what Archive-It captures—for instance, as of December 2023, the crawl data from X (Twitter) and Facebook seeds is unusable because of log-in requirements and other coding barriers, and these issues change as fast as the technology does.
- Crawler traps can create link loops that expand the data collected from a site by duplicating the same links infinitely (or until the crawl hits 100 hops or 7 days).
Every seed is QC’d once a year by the web and social media archiving team, which allows the team to run patch crawls on any missing pages—but that means that only 1 in 6 crawls for each seed has been checked for completeness.
For all these reasons, the data captured by Archive-It is considered only a snapshot of the website at a given time and cannot be relied upon as a perfect copy of the web content of a seed. Thus, even if other record types are captured via the Archive-It web crawls (for instance, meeting minutes loaded as PDFs on an agency website), the Archive-It copy should never be considered the record copy.
By contrast, ArchiveSocial captures social media content in real-time. It relies on APIs on the backend of social media platforms to download the content of each data component of the site, including posts, messages, events, and so on. ArchiveSocial is limited by the availability of APIs (for instance, Threads does not have an API as of December 2023) and by our account level, which only allows 125 accounts to be captured at a time. (As of December 2023, we have 117 historical accounts and 115 active accounts in ArchiveSocial comprising 8,686,071 records at an average clip of 47,817 records per month.) The platforms that are supported as of December 2023 are:
- Flickr
- Facebook Groups
- Facebook Pages
- Instagram Business
- Instagram Personal
- LinkedIn Company
- LinkedIn Personal
- Google+
- TikTok
- X (Twitter)
- Vimeo
- YouTube
For the website archives in Archive-IT, several factors can limit or expand the amount of data captured for each seed, depending on how often the site is updated and how much data there is to crawl. In general:
- We crawl most sites every two months.
- Sites for boards and commissions are crawled only once a year.
- Data that has been added and removed within the window between crawls will not be captured.
By contrast, ArchiveSocial captures social media content in real-time. It relies on APIs on the backend of social media platforms to download the content of each data component of the site, including posts, messages, events, and so on.
The State Government Website and Social Media Archive is a joint venture between the State Archives of North Carolina (SANC) and the State Library of North Carolina (SLNC). This distinction denotes that both these organizations are involved in the implementation and management of web and social media archiving. We work together to run crawls over the requested websites and social media sites to capture the data in Archive-IT and create a record of this digital information.
State Archives of North Carolina Members:
- Digital Services Section Head
- Digital Archivist
- Systems Integration Librarian
- Information Management Archivist
- Digital Description Archivist
State Library of North Carolina Members:
- Digital Projects Librarian
- State Publications Clearinghouse Liaison
- Systems Support Librarian
What we do:
- We crawl websites and social media accounts using Archive-It and ArchiveSocial to grab a record of the agency’s online business.
- We add URLs as seeds to Archive-IT and potentially to ArchiveSocial (depending on appraisal value) at the request of the agency to their analyst.
- We inactivate seeds that are no longer in operation and no longer need to be crawled. When marked inactive, the platform will retain the previous information captured as a historical record, and the URL is marked as no longer needing to be crawled.
- We divide the crawl results for the bi-monthly and yearly crawls to perform quality control over the captured information. If any issues arise during quality control, we discuss possible solutions as a group and send tickets to Archive-IT for further details
What we don’t do:
- We do not appraise websites and social media accounts for archival content. This duty falls to the coordinating records analyst within the Records Analysis Unit of the State Archives, the Records Description Unit supervisor, as well as the appraisal archivist.
- We do not approach agencies to archive their sites. Agencies should speak with their coordinating analyst to determine if their websites or social media accounts can and should be archived.
- We do not capture every social media account. As previously mentioned, we are limited to what the tools can capture.
For state agencies, websites and social media records are scheduled in the Functional Schedule for North Carolina State Agencies in records retention schedule “15. Public Relations,” available at https://archives.ncdcr.gov/public-relations.
RC No. 1515 Social Media and Websites identifies three possible retentions for state agency web and social media content:
Note that “social media sites and other websites that have historical content” are scheduled as permanent records, but an appraisal step is required to determine whether the State Archives will collect the content or the agency will be responsible for maintaining its social media and website records in office.
Routine social media records have a 5-year retention, and records produced during planning and executing social media activities may be destroyed once they are superseded or obsolete.
The Government Records Section has developed and is in the process of expanding a set of social media appraisal criteria and an accompanying workflow to identify social media of enduring value. This is particularly important for social media accounts that cannot be captured via Archive-It (i.e., X (Twitter), Facebook) because our account with the platform for social media capture, ArchiveSocial, has limited storage space. In general, the criteria follow the guidance provided in the Functional Schedule “Overview” document’s section titled “Historical Value.”
A list of records analysts and the agencies they work with is available on the State Archives of North Caroline website.