BitCurator NLP Project

The BitCurator NLP project began on October 1, 2016 and ended on December 31, 2018. BitCurator NLP was supported by a grant from The Andrew W. Mellon Foundation (grant 31600663.

BitCurator NLP project personnel developed software for collecting institutions to extract, analyze, and produce reports on features of interest in text extracted from born-digital materials contained in collections.

The software uses existing natural language processing software libraries to identify and report on those items likely to be relevant to ongoing preservation, information organization, and access activities. These may include entities (e.g. persons, places, and organizations), potential relationships among entities (for example, by describing those entities that appear together within documents or set of documents), and topic models to provide insight into how concepts are naturally clustered within the documents.

Visit the BitCurator NLP wiki page for technical content, documentation, and software downloads.

People

The personnel and advisory group bios below cover involvement between October 2016 and September 2018. Affiliations and positions listed reflect status of the involved personnel and advisors at the beginning of the project.

Project Personnel
Advisory Group

Project Personnel

Christopher (Cal) Lee (PI), University of North Carolina at Chapel Hill (Additional Info)

Christopher Lee

Christopher (Cal) Lee is a Professor at the School of Information and Library Science at the University of North Carolina, Chapel Hill. He teaches archival administration; records management; digital curation; understanding information technology for managing digital collections; and digital forensics. He is a lead organizer and instructor for the DigCCurr Professional Institute, and he teaches professional workshops on the application of digital forensics methods and principles.

Cal’s primary area of research is curation of digital collections. He is particularly interested in the professionalization of this work and the diffusion of existing tools and methods into professional practice. Cal developed “A Framework for Contextual Information in Digital Collections,” and edited and provided several chapters to I, Digital: Personal Collections in the Digital Era published by the Society of American Archivists.

Cal is Principal Investigator of BitCurator Access and was Principal Investigator of BitCurator; both projects have developed and disseminated open-source digital forensics tools for use by archivists and librarians. He was also Principal Investigator of the Digital Acquisition Learning Laboratory (DALL) project and is Senior Personnel on the DataNet Federation Consortium funded by the National Science Foundation. Cal has served as Co-PI on several projects focused on digital curation education: Preserving Access to Our Digital Future: Building an International Digital Curation Curriculum (DigCCurr), DigCCurr II: Extending an International Digital Curation Curriculum to Doctoral Students and Practitioners; Educating Stewards of Public Information for the 21st Century (ESOPI-21), Educating Stewards of the Public Information Infrastructure (ESOPI2), and Closing the Digital Curation Gap (CDCG).

Kam Woods (Co-PI, Technical Lead), University of North Carolina at Chapel Hill (Additional Info)

Kam Woods

Kam Woods is a Research Scientist in the School of Information and Library Science at the University of North Carolina at Chapel Hill.

Sunitha Misra (Software Developer), University of North Carolina at Chapel Hill (Additional Info)

Sunitha Misra

Sunitha Misra is a Software Developer for the BitCurator NLP project in the School of Information and Library Science at the University of North Carolina at Chapel Hill. She holds a Masters in Information Sciences from UNC SILS, and an MS in Computer Science from the University of Alabama in Huntsville. Previously, she worked as a Software Developer for major Networking and Operating Systems companies in the San Francisco Bay area and in Research Triangle Park.

Jacob Hill (Project Manager), University of North Carolina at Chapel Hill (Additional Info)

Jacob Hill

Jacob Hill is the Project Manager for the BitCurator NLP project in the School of Information and Library Science at the University of North Carolina at Chapel Hill. He holds a BA in History from the University of Nevada, Reno and an MSIS from North Carolina Central University. His research interests include knowledge organization, digital humanities, Baha’i studies, and Arabic & Persian manuscripts.

Advisory Group

Mary Elings, University of California, Berkeley

Mary Elings

Mary W. Elings is the Interim Head of Technical Services and Principal Archivist for Digital Collections at The Bancroft Library at the University of California, Berkeley. She leads the acquisitions, cataloging, and processing units and is responsible for all aspects of the digital collections, including digital initiatives and the born digital archives program. Prior to coming to the Bancroft, Ms. Elings worked in museums focusing on art conservation, collection documentation, conservation imaging, information and asset management, and digitization initiatives. Her current work concentrates on issues surrounding born-digital materials, supporting digital humanities and digital social sciences, and research data management. Ms. Elings has taught as an adjunct professor in the School of Information Studies at Syracuse University, New York and the School of Library and Information Science, Catholic University, Washington, DC, and is a regular guest-lecturer in the John F. Kennedy University Museum Studies program.

Mark A. Matienzo, Stanford

Mark Matienzo

Mark is the Collaboration & Interoperability Architect in Digital Library Systems and Services at the Stanford University Libraries, serving as a technologist, advocate, and facilitator for cross-institutional projects. Prior to joining Stanford, Mark worked as an archivist, technologist, and strategist specializing in born-digital materials and metadata management, at institutions including the Digital Public Library of America, Yale University Library, The New York Public Library, and the American Institute of Physics. Mark received a MSI from the University of Michigan School of Information and a BA in Philosophy from the College of Wooster, and was a recipient of the Emerging Leader Award from the Society of American Archivists in 2012.

Laney McGlohon, ArchiveSpace

Laney McGlohon

Laney is an information scientist, software developer, librarian, and self-described data wrangler with seven years of experience working with special collections and institutional archivists. Most recently, Laney served as the Discovery and Access Engineer at Stanford University Libraries. Before that she served as a Software Engineer at the Getty Research Institute and as Technology Consultant at the Museum of Ventura County. She has also worked at the University of Georgia and Raytheon Systems Corporation. Laney earned a Bachelor’s of Science in Mathematical Sciences from University of North Carolina, Chapel Hill, a Master’s of Science in Applied Mathematics from North Carolina State University and her Master’s in Library Science from University of North Carolina, Chapel Hill.

Don Mennerich, New York University

Don Mennerich

Don Mennerich joined DLTS in January 2014 as a Digital Archivist, working primarily with forensic tools and their relationship to managing born-digital archives. Prior to working at NYU, he held positions at The New York Public Library, Beinecke Rare Book and Manuscript Library, and Yale University Library. Don holds an MS in Information Systems from Pace University and an MLS from Simmons College. Don is a member of both DLTS and the Archival Collections Management unit.

Michael Piotrowski, Leibniz Institute of European History

Michael Piotrowski

Michael Piotrowski is professor of digital humanities at the University of Lausanne in Switzerland. His main research interests are knowledge representation and formal modeling in the humanities, and document engineering. He is the author of the first textbook on NLP for historical texts (published in 2012 by Morgan & Claypool). Before joining the University of Lausanne February 2017, he set up and headed the Digital Humanities research group at the Leibniz Institute of European History in Mainz, Germany. His PhD in Computer Science is from Otto von Guericke University Magdeburg, Germany (2009), his MA in Computational Linguistics, English Philology, and Applied Linguistics is from Friedrich Alexander University Erlangen-Nuremberg, Germany (1998).

Daniel Pitti, University of Virginia

Daniel Pitti

Daniel Pitti is Associate Director of the Institute for Advanced Technology in the Humanities at the University of Virginia. Pitti currently serves as the chair/président of the International Council on Archives Experts Group on Archival Description, charged with developing an archival description conceptual model called Records in Contexts (RiC). From 1993-2010, Pitti served as the chief technical architect of Encoded Archival Description (EAD, an international standard for encoding archival guides, and Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF), an international standard for encoding archival descriptions of persons, organizations, and families. Pitti is project director of the Social Networks and Archival Context (SNAC). At IATH, Pitti collaborates with faculty fellows and other scholars in humanities research projects that employ innovative methods based on computer and network technologies. Among the many humanities projects are the William Blake Archive; the Walt Whitman Archive; Leonardo’s Treatise on Painting; Arapesh Grammar and Digital Language Archive; and Collective Biographies of Women.

Josh Schneider, Stanford University

Josh Schneider

Josh Schneider is Assistant University Archivist at Stanford University, where he acquires and supports researcher use of Stanford University records, faculty papers, and materials documenting campus and student life. His case study on appraisal of electronic records appeared in the latest volume of the Society of American Archivists’ Trends in Archival Practice series. Josh is also Community Manager for ePADD, an open-source software package that uses named entity recognition and other NLP-driven processes to support the appraisal, processing, discovery, and delivery of email archives. Josh serves on the editorial boards of The American Archivist, Journal of Western Archives, and the blog of SAA’s Electronic Records Section (BloggERS!). He received an MLIS from Simmons College and a BA in Philosophy from Brown University.

Ryan Shaw, University of North Carolina at Chapel Hill

Ryan Shaw

Ryan Shaw received his Ph.D. in 2010 from the University of California, Berkeley School of Information, where he wrote his dissertation on how events and periods function as concepts for organizing historical knowledge. He is also the author of the LODE (Linking Open Descriptions of Events) ontology, recently adopted by the UK Archives Hub for their Linked Data effort. In 2012 he received a three-year Early Career Development grant from the Institute of Museum and Library Services to invent new tools for applying computational text processing techniques to organize large collections of civil rights histories. He is also a co-PI of the Editors’ Notes project, a Mellon Foundation-funded effort to develop open, collaborative notebooks for humanists, and the PeriodO project, an NEH-funded gazetteer of scholarly assertions about the extents of historical, art-historical, and archaeological periods. In the past he has been involved in a number of digital humanities projects through his work with the Electronic Cultural Atlas Initiative. In a previous life, he worked as a software engineer in Tokyo, Japan.

Stéfan Sinclair, McGill University

Stéfan Sinclair

Stéfan Sinclair is an Associate Professor of Digital Humanities at McGill University. His primary area of research is in the design, development, usage and theorization of tools for the digital humanities, especially for text analysis and visualization. He has led or contributed significantly to projects such as Voyant Tools, the Text Analysis Portal for Research (TAPoR), the MONK Project, the Simulated Environment for Theatre, the Mandala Browser, and BonPatron. In additional to his work developing sophisticated scholarly tools, Sinclair has numerous publications related to research and teaching in the Digital Humanities, including Visual Interface Design for Digital Cultural Heritage, co-authored with Stan Ruecker and Milena Radzikowska (Ashgate 2011).

Other professional activities include serving as President of the Association for Computers and the Humanities (ACH), on executive committees of the the Canadian Society for Digital Humanities / Société pour l’étude de médias interactifs (CSDH/SCHN), the Alliance of Digital Humanities Associations (ADHO) and centerNET, and as an editor of Digital Humanities Quarterly (Digital Humanities Quarterly). Prior to moving to McGill University, Sinclair Associate Professor in the Department of Communication Studies and Multimedia at McMaster University from 2004 to 2011, where he was also Director of the Sherman Centre for Digital Scholarship. Before joining McMaster University, he was at the University of Alberta where he was co-responsible for the creation and development of the M.A. in Humanities Computing program from 2001 to 2004. His Ph.D. in French Literature is from Queen’s University (2000), his M.A. in French literature is from the University of Victoria (1995), and his honors B.A. in French is from the University of British Columbia (1994).

Carl Wilson, Open Preservation Foundation

Carl Wilson

Carl is the Technical Lead for the Open Preservation Foundation, overseeing all of OPF’s technical activities. He is an experienced software engineer with a focus on software quality through testing. He is an open source enthusiast, both as a user and developer. His professional interest is using virtualisation, automation and continuous delivery techniques to improve the software development process. Carl is leading the development team for veraPDF and is responsible for the software quality, website development and continuous integration / delivery. Before this he was responsible for OPF’s technical contribution to the SCAPE project. Prior to joining OPF Carl worked for The British Library’s Digital Preservation Team on internal and external projects.