BitCurator

Software produced for the BitCurator family of research projects (BitCurator, BitCurator Access, BitCurator NLP).

View My GitHub Profile

NLP4ARC Events

The NLP4ARC symposia were held 2017-2018 in conjunction with the BitCurator NLP project. Original event and program details are provided below.

nlp4arc 2017

Event Information

3 February 2017 9:00am – 5:00pm Student Union rooms 3206A and 3206B, University of North Carolina, Chapel Hill, North Carolina

Suggested hashtag: #nlp4arc

About the Symposium

The symposium consisted of a number of short talks and unconference style break-out sessions on the application of natural language processing (NLP) to support use, access, and analysis of digital primary source materials.

A rapidly growing body of materials with significant cultural value are “born digital.” Information professionals must be prepared to extract digital materials from their original environments and media in ways that reflect the rich metadata and ensure the integrity of the materials. They must also support new forms of access: allowing users to make sense of materials and understand their context.

There are many types of contextual information that can be vital to making sense and meaningful use of digital objects. These can include objects, agents, occurrences, purposes, times, places, form of expressions, concepts/abstractions and relationships.

There are many existing open-source tools that libraries, archives and museums (LAMs) can use to identify, extract and expose such contextual entities from the wide diversity of born-digital materials that LAMs already hold and continue to receive. NLP tools and methods can help to both (1) facilitate curatorial decision making and description, and (2) generate access points to be presented to end users.

Program

9:00-9:15

Welcome and introduction - Cal Lee

9:15-10:45

Challenges and Opportunities in Applying NLP to Digital Collections

10:45-11:00

Break

11:00-12:30

From Projects to Programs

12:30-1:30

Lunch

1:30-2:00

Kam Woods, University of North Carolina at Chapel Hill – BitCurator NLP Development and Plans

2:00-2:30

Generation of Breakout Topics

2:30-2:45

Break

2:45-3:30

Breakout Sessions

3:30-4:00

Reporting Back from Breakout Sessions

4:00-5:00

Wrap Up and Next Step

Speaker Biographies

Christopher (Cal) Lee (PI), University of North Carolina at Chapel Hill

Christopher (Cal) Lee is Associate Professor at the School of Information and Library Science at the University of North Carolina, Chapel Hill. He teaches archival administration; records management; digital curation; understanding information technology for managing digital collections; and digital forensics. He is a lead organizer and instructor for the DigCCurr Professional Institute, and he teaches professional workshops on the application of digital forensics methods and principles.

Cal’s primary area of research is curation of digital collections. He is particularly interested in the professionalization of this work and the diffusion of existing tools and methods into professional practice. Cal developed “A Framework for Contextual Information in Digital Collections,” and edited and provided several chapters to I, Digital: Personal Collections in the Digital Era published by the Society of American Archivists.

Cal is Principal Investigator of BitCurator Access and was Principal Investigator of BitCurator; both projects have developed and disseminated open-source digital forensics tools for use by archivists and librarians. He was also Principal Investigator of the Digital Acquisition Learning Laboratory (DALL) project and is Senior Personnel on the DataNet Federation Consortium funded by the National Science Foundation. Cal has served as Co-PI on several projects focused on digital curation education: Preserving Access to Our Digital Future: Building an International Digital Curation Curriculum (DigCCurr), DigCCurr II: Extending an International Digital Curation Curriculum to Doctoral Students and Practitioners; Educating Stewards of Public Information for the 21st Century (ESOPI-21), Educating Stewards of the Public Information Infrastructure (ESOPI2), and Closing the Digital Curation Gap (CDCG).

Kam Woods (Co-PI, Technical Lead), University of North Carolina at Chapel Hill

Kam Woods is a Research Scientist in the School of Information and Library Science at the University of North Carolina at Chapel Hill. His research focuses on long-term preservation of born-digital materials.

Mary Elings, University of California, Berkeley

Mary W. Elings is the Interim Head of Technical Services and Principal Archivist for Digital Collections at The Bancroft Library at the University of California, Berkeley. She leads the acquisitions, cataloging, and processing units and is responsible for all aspects of the digital collections, including digital initiatives and the born digital archives program. Prior to coming to the Bancroft, Ms. Elings worked in museums focusing on art conservation, collection documentation, conservation imaging, information and asset management, and digitization initiatives. Her current work concentrates on issues surrounding born-digital materials, supporting digital humanities and digital social sciences, and research data management. Ms. Elings has taught as an adjunct professor in the School of Information Studies at Syracuse University, New York and the School of Library and Information Science, Catholic University, Washington, DC, and is a regular guest-lecturer in the John F. Kennedy University Museum Studies program.

Don Mennerich, New York University

Don Mennerich joined DLTS in January 2014 as a Digital Archivist, working primarily with forensic tools and their relationship to managing born-digital archives. Prior to working at NYU, he held positions at The New York Public Library, Beinecke Rare Book and Manuscript Library, and Yale University Library. Don holds an MS in Information Systems from Pace University and an MLS from Simmons College. Don is a member of both DLTS and the Archival Collections Management unit.

Daniel Pitti, University of Virginia

Daniel Pitti is Associate Director of the Institute for Advanced Technology in the Humanities at the University of Virginia. Pitti currently serves as the chair/président of the International Council on Archives Experts Group on Archival Description, charged with developing an archival description conceptual model called Records in Contexts (RiC). From 1993-2010, Pitti served as the chief technical architect of Encoded Archival Description (EAD, an international standard for encoding archival guides, and Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF), an international standard for encoding archival descriptions of persons, organizations, and families. Pitti is project director of the Social Networks and Archival Context (SNAC). At IATH, Pitti collaborates with faculty fellows and other scholars in humanities research projects that employ innovative methods based on computer and network technologies. Among the many humanities projects are the William Blake Archive; the Walt Whitman Archive; Leonardo’s Treatise on Painting; Arapesh Grammar and Digital Language Archive; and Collective Biographies of Women.

Josh Schneider, Stanford University

Josh Schneider is Assistant University Archivist at Stanford University, where he acquires and supports researcher use of Stanford University records, faculty papers, and materials documenting campus and student life. His case study on appraisal of electronic records appeared in the latest volume of the Society of American Archivists’ Trends in Archival Practice series. Josh is also Community Manager for ePADD, an open-source software package that uses named entity recognition and other NLP-driven processes to support the appraisal, processing, discovery, and delivery of email archives. Josh serves on the editorial boards of The American Archivist, Journal of Western Archives, and the blog of SAA’s Electronic Records Section (BloggERS!). He received an MLIS from Simmons College and a BA in Philosophy from Brown University.

Ryan Shaw, University of North Carolina at Chapel Hill

Ryan Shaw received his Ph.D. in 2010 from the University of California, Berkeley School of Information, where he wrote his dissertation on how events and periods function as concepts for organizing historical knowledge. He is also the author of the LODE (Linking Open Descriptions of Events) ontology, recently adopted by the UK Archives Hub for their Linked Data effort. In 2012 he received a three-year Early Career Development grant from the Institute of Museum and Library Services to invent new tools for applying computational text processing techniques to organize large collections of civil rights histories. He is also a co-PI of the Editors’ Notes project, a Mellon Foundation-funded effort to develop open, collaborative notebooks for humanists, and the PeriodO project, an NEH-funded gazetteer of scholarly assertions about the extents of historical, art-historical, and archaeological periods. In the past he has been involved in a number of digital humanities projects through his work with the Electronic Cultural Atlas Initiative. In a previous life, he worked as a software engineer in Tokyo, Japan.

Carl Wilson, Open Preservation Foundation

Carl is the Technical Lead for the Open Preservation Foundation, overseeing all of OPF’s technical activities. He is an experienced software engineer with a focus on software quality through testing. He is an open source enthusiast, both as a user and developer. His professional interest is using virtualisation, automation and continuous delivery techniques to improve the software development process. Carl is leading the development team for veraPDF and is responsible for the software quality, website development and continuous integration / delivery. Before this he was responsible for OPF’s technical contribution to the SCAPE project. Prior to joining OPF Carl worked for The British Library’s Digital Preservation Team on internal and external projects.

Hugh Cayless, Duke University Libraries

Hugh has over a decade of software engineering expertise in both academic and industrial settings. He also holds a Ph.D. in Classics and a Master’s in Information Science. He is one of the founders of the EpiDoc collaborative and currently serves on the Technical Council of the Text Encoding Initiative.

Jeremy Gibson, North Carolina Department of Natural and Cultural Resources

Jeremy Gibson is the Systems Integration Librarian at the State Archives of North Carolina. His responsibilities include supporting the technological initiatives of the State Archives of North Carolina; general systems support; writing purpose built tools; and managing system wide projects.

npl4arc 2018

Event Information

2 February 2018 9:00am – 5:00pm Dey Hall, Toy Lounge, University of North Carolina, Chapel Hill, North Carolina

About the Symposium

BitCurator NLP will host “nlp4arc – Enabling New Forms of Access to Primary Sources through Natural Language Processing.” The event will focus on the application of natural language processing (NLP) to support use, access, and analysis of digital primary source materials. Click here to register.

nlp4arc 2018 February 2, 2018 – 9:00am – 5:00pm Dey Hall, Toy Lounge University of North Carolina Chapel Hill, North Carolina Suggested hashtag: #nlp4arc

Program

9:00-9:15

Welcome and introduction – Cal Lee

9:15-10:30

Foundations and Strategies

10:30-10:45

Break

10:45-12:00

Implementation and Projects

12:00-12:30

Panel on NLP Lessons Learned

12:30-1:30

Lunch

1:30-2:15

Enabling Technologies

2:15-2:45

Generation of Breakout Topics

2:45-3:00

Break

3:00-3:45

Breakout Sessions

3:45-4:15

Reporting Back from Breakout Sessions

4:15-5:00

Wrap Up and Next Steps

Speaker Biographies

Jaime Arguello, University of North Carolina at Chapel Hill

Dr. Jaime Arguello teaches courses and conducts research in the areas of information retrieval, data mining, and machine learning. His main area of research is aggregated search, where the goal is to develop search systems that integrate results from multiple independent sources. Dr. Arguello develops algorithms and evaluation methodologies for deciding which sources to select and how to display them. His most recent research studies how users interact with aggregated search displays and how differences in display affect users’ expectations and behaviors.

Dr. Arguello’s second main area of research focuses on search assistance, where the goal is to develop interactions to help search engine users working on complex tasks. This research aims to understand when and how people employ search assistance. The ultimate goal is to develop systems that automatically provide assistance at the right times and in the appropriate ways.

Dr. Arguello holds a Ph.D. in Computer Science from Carnegie Mellon University. He publishes regularly at information retrieval venues such as SIGIR, ECIR, CIKM, and IIIX. He was a recipient of the SIGIR 2009 Best Paper Award and the ECIR 2011 Best Student Paper Award, and was awarded the NSF CAREER Award in 2015.

Mary Elings, University of California, Berkeley

Mary W. Elings is the Interim Head of Technical Services and Principal Archivist for Digital Collections at The Bancroft Library at the University of California, Berkeley. She leads the acquisitions, cataloging, and processing units and is responsible for all aspects of the digital collections, including digital initiatives and the born digital archives program. Prior to coming to the Bancroft, Ms. Elings worked in museums focusing on art conservation, collection documentation, conservation imaging, information and asset management, and digitization initiatives. Her current work concentrates on issues surrounding born-digital materials, supporting digital humanities and digital social sciences, and research data management. Ms. Elings has taught as an adjunct professor in the School of Information Studies at Syracuse University, New York and the School of Library and Information Science, Catholic University, Washington, DC, and is a regular guest-lecturer in the John F. Kennedy University Museum Studies program.

Stephanie Haas, University of North Carolina at Chapel Hill

Stephanie W. Haas is a Professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill, and Program Coordinator for the SILS Master’s of Science in Information Science program. Her research interests focus on the representation of information, and how representations enhance or impede work processes. More specifically, she is interested in natural language processing: what computers do with the language people use. Current and recent projects study these issues in collaboration with researchers from UNC’s Schools of Medicine, Nursing, and Public Health, examining information representation in patient records, and its use for improving patient care. An award-winning teacher, she teach courses in Applications of Natural Language Processing, Systems Analysis, and Database Design (including an online version).

Christopher (Cal) Lee (PI), University of North Carolina at Chapel Hill

Christopher (Cal) Lee is Associate Professor at the School of Information and Library Science at the University of North Carolina, Chapel Hill. He teaches archival administration; records management; digital curation; understanding information technology for managing digital collections; and digital forensics. He is a lead organizer and instructor for the DigCCurr Professional Institute, and he teaches professional workshops on the application of digital forensics methods and principles.

Cal’s primary area of research is curation of digital collections. He is particularly interested in the professionalization of this work and the diffusion of existing tools and methods into professional practice. Cal developed “A Framework for Contextual Information in Digital Collections,” and edited and provided several chapters to I, Digital: Personal Collections in the Digital Era published by the Society of American Archivists.

Cal is Principal Investigator of BitCurator Access and was Principal Investigator of BitCurator; both projects have developed and disseminated open-source digital forensics tools for use by archivists and librarians. He was also Principal Investigator of the Digital Acquisition Learning Laboratory (DALL) project and is Senior Personnel on the DataNet Federation Consortium funded by the National Science Foundation. Cal has served as Co-PI on several projects focused on digital curation education: Preserving Access to Our Digital Future: Building an International Digital Curation Curriculum (DigCCurr), DigCCurr II: Extending an International Digital Curation Curriculum to Doctoral Students and Practitioners; Educating Stewards of Public Information for the 21st Century (ESOPI-21), Educating Stewards of the Public Information Infrastructure (ESOPI2), and Closing the Digital Curation Gap (CDCG).

Mark A. Matienzo, Stanford

Mark is the Collaboration & Interoperability Architect in Digital Library Systems and Services at the Stanford University Libraries, serving as a technologist, advocate, and facilitator for cross-institutional projects. Prior to joining Stanford, Mark worked as an archivist, technologist, and strategist specializing in born-digital materials and metadata management, at institutions including the Digital Public Library of America, Yale University Library, The New York Public Library, and the American Institute of Physics. Mark received a MSI from the University of Michigan School of Information and a BA in Philosophy from the College of Wooster, and was a recipient of the Emerging Leader Award from the Society of American Archivists in 2012.

Laney McGlohon, ArchiveSpace

Laney is an information scientist, software developer, librarian, and self-described data wrangler with seven years of experience working with special collections and institutional archivists. Most recently, Laney served as the Discovery and Access Engineer at Stanford University Libraries. Before that she served as a Software Engineer at the Getty Research Institute and as Technology Consultant at the Museum of Ventura County. She has also worked at the University of Georgia and Raytheon Systems Corporation. Laney earned a Bachelor’s of Science in Mathematical Sciences from University of North Carolina, Chapel Hill, a Master’s of Science in Applied Mathematics from North Carolina State University and her Master’s in Library Science from University of North Carolina, Chapel Hill.

Michael Piotrowski, Leibniz Institute of European History

Michael Piotrowski is professor of digital humanities at the University of Lausanne in Switzerland. His main research interests are knowledge representation and formal modeling in the humanities, and document engineering. He is the author of the first textbook on NLP for historical texts (published in 2012 by Morgan & Claypool). Before joining the University of Lausanne February 2017, he set up and headed the Digital Humanities research group at the Leibniz Institute of European History in Mainz, Germany. His PhD in Computer Science is from Otto von Guericke University Magdeburg, Germany (2009), his MA in Computational Linguistics, English Philology, and Applied Linguistics is from Friedrich Alexander University Erlangen-Nuremberg, Germany (1998).

Daniel Pitti, University of Virginia

Daniel Pitti is Associate Director of the Institute for Advanced Technology in the Humanities at the University of Virginia. Pitti currently serves as the chair/président of the International Council on Archives Experts Group on Archival Description, charged with developing an archival description conceptual model called Records in Contexts (RiC). From 1993-2010, Pitti served as the chief technical architect of Encoded Archival Description (EAD, an international standard for encoding archival guides, and Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF), an international standard for encoding archival descriptions of persons, organizations, and families. Pitti is project director of the Social Networks and Archival Context (SNAC). At IATH, Pitti collaborates with faculty fellows and other scholars in humanities research projects that employ innovative methods based on computer and network technologies. Among the many humanities projects are the William Blake Archive; the Walt Whitman Archive; Leonardo’s Treatise on Painting; Arapesh Grammar and Digital Language Archive; and Collective Biographies of Women.

Ryan Shaw, University of North Carolina at Chapel Hill

Ryan Shaw received his Ph.D. in 2010 from the University of California, Berkeley School of Information, where he wrote his dissertation on how events and periods function as concepts for organizing historical knowledge. He is also the author of the LODE (Linking Open Descriptions of Events) ontology, recently adopted by the UK Archives Hub for their Linked Data effort. In 2012 he received a three-year Early Career Development grant from the Institute of Museum and Library Services to invent new tools for applying computational text processing techniques to organize large collections of civil rights histories. He is also a co-PI of the Editors’ Notes project, a Mellon Foundation-funded effort to develop open, collaborative notebooks for humanists, and the PeriodO project, an NEH-funded gazetteer of scholarly assertions about the extents of historical, art-historical, and archaeological periods. In the past he has been involved in a number of digital humanities projects through his work with the Electronic Cultural Atlas Initiative. In a previous life, he worked as a software engineer in Tokyo, Japan.

Stéfan Sinclair, McGill University

Stéfan Sinclair is an Associate Professor of Digital Humanities at McGill University. His primary area of research is in the design, development, usage and theorization of tools for the digital humanities, especially for text analysis and visualization. He has led or contributed significantly to projects such as Voyant Tools, the Text Analysis Portal for Research (TAPoR), the MONK Project, the Simulated Environment for Theatre, the Mandala Browser, and BonPatron. In additional to his work developing sophisticated scholarly tools, Sinclair has numerous publications related to research and teaching in the Digital Humanities, including Visual Interface Design for Digital Cultural Heritage, co-authored with Stan Ruecker and Milena Radzikowska (Ashgate 2011).

Other professional activities include serving as President of the Association for Computers and the Humanities (ACH), on executive committees of the the Canadian Society for Digital Humanities / Société pour l’étude de médias interactifs (CSDH/SCHN), the Alliance of Digital Humanities Associations (ADHO) and centerNET, and as an editor of Digital Humanities Quarterly (Digital Humanities Quarterly). Prior to moving to McGill University, Sinclair Associate Professor in the Department of Communication Studies and Multimedia at McMaster University from 2004 to 2011, where he was also Director of the Sherman Centre for Digital Scholarship. Before joining McMaster University, he was at the University of Alberta where he was co-responsible for the creation and development of the M.A. in Humanities Computing program from 2001 to 2004. His Ph.D. in French Literature is from Queen’s University (2000), his M.A. in French literature is from the University of Victoria (1995), and his honors B.A. in French is from the University of British Columbia (1994).

Carl Wilson, Open Preservation Foundation

Carl is the Technical Lead for the Open Preservation Foundation, overseeing all of OPF’s technical activities. He is an experienced software engineer with a focus on software quality through testing. He is an open source enthusiast, both as a user and developer. His professional interest is using virtualisation, automation and continuous delivery techniques to improve the software development process. Carl is leading the development team for veraPDF and is responsible for the software quality, website development and continuous integration / delivery. Before this he was responsible for OPF’s technical contribution to the SCAPE project. Prior to joining OPF Carl worked for The British Library’s Digital Preservation Team on internal and external projects.

Kam Woods (Co-PI, Technical Lead), University of North Carolina at Chapel Hill

Kam Woods is a Research Scientist in the School of Information and Library Science at the University of North Carolina at Chapel Hill. His research focuses on long-term preservation of born-digital materials.