Reproducible Research and Software Development Methods for Management tasks
目录
The amount of administrative tasks in today's workplace is steadily 当今工作场所的行政工作量是稳定的 growing, whether you are a manager or just driven by requirements from 成长,无论你是经理还是仅仅被需求驱动 management for more and more reports. The introduction of IT into these 管理越来越多的报告。把它引入这些 processes held the big potential of easing some of the tedious tasks 流程有很大的潜力来简化一些繁琐的任务 through automatization and integration, but it seems that we are still a 通过自动化和集成,但我们似乎仍然是一个 long way away from that goal. Segregated toolchains, lack of compatibility 离那个目标还有很长的路要走。分隔的工具链,缺乏兼容性 of interfaces and different work cultures combined with the increased 界面和不同的工作文化相结合 amount of communication have resulted in an increase of 4 .通讯量的增加导致了 stress and drudgery. Dealing with the fallout often requires substantial manual 压力和乏味。处理放射性沉降物通常需要大量的人工 effort, frequently in a form which is not reusable and sustainable. 努力,通常以不可重复使用和不可持续的形式。
What can we as scientists and software engineers do to improve our 作为科学家和软件工程师,我们能做些什么来提高我们的 situation? I am asking this in the most pragmatic sense and I will 情况?我以最务实的态度提出这个问题,我也会这样做 describe what I found over the last years to work for me. So, this is 描述一下我在过去几年里为我工作的发现。所以,这是 not about demanding that the world suddenly becomes perfect and 不是要求世界突然变得完美 reasonable by magically establishing compatible standards and adopt 合理的通过神奇的建立兼容的标准并采用 harmonized best practices agreed on by everyone - this will not 协调每个人都同意的最佳实践——这是不可能的 happen. I just want to demonstrate that we can just use our knowledge 发生。我只是想证明我们可以利用我们的知识 from science and engineering and combine them smartly with existing 把它们巧妙地与现有的科学和工程结合起来 frameworks. My own background is in physical chemistry with long years 框架。我自己的背景是物理化学,而且学了很多年 working in scientific computing, so management is for me another 在科学计算领域工作,所以管理对我来说是另一个 acquired skill which I had to grow into (and sometimes I still feel 获得的技能,我必须成长(有时我仍然觉得 baffled by some of the concepts...). 被一些概念搞糊涂了……
So let's see what we can do based on the toolchains that we learn in 所以让我们看看我们可以根据我们学到的工具链做些什么 data science and software engineering. I will use GNU Emacs to present 数据科学和软件工程。我将使用GNU Emacs来展示 my current workflow, but the ideas are generic and you may find other solutions. 我目前的工作流程,但想法是通用的,你可以找到其他的解决方案。
If you are just interested into the Emacs workflow using my new package org-listcruncher 如果您对Emacs工作流感兴趣,可以使用我的新包org-listcruncher you can just scroll down to the example section. 您可以向下滚动到示例部分。
1 The problem
1问题
The sad reality is that a lot of management document methodology 可悲的现实是,许多管理文档的方法 comes in the form of Excel and Powerpoint documents. While I think that Excel 以Excel和Powerpoint文档的形式出现。我认为这很好 is a good piece of software for enabling people to obtain results fast and 是一个好的软件,使人们能够获得快速的结果和 in a nicely interactive way, it's work model for complex or 以一种很好的交互方式,它是复杂或的工作模型 long-lived documents is definitely sub-optimal. 长寿命文档肯定不是最优的。
Reasons 原因
- changes to the document are difficult to track in a transparent and comprehensible way.
- *难以以透明和可理解的方式跟踪文件的更改。
Compare this to the established way we track software changes in revision control 将此与我们在修订控制中跟踪软件更改的既定方法进行比较 systems like git. There we not only track each change, but also the motivations for it. 系统,如git。在那里,我们不仅跟踪每个变化,而且跟踪它的动机。
- as the Excel documents gets bigger, it tends to contain many sheets with lots of references.
-当Excel文档变得更大,它往往包含许多表与大量的参考。 Errors easily creep in and are difficult to spot (e.g. numbers outside of formula cell ranges 错误很容易出现,而且很难发现(例如,公式单元格范围之外的数字) which are silently ignored). Compare this to unit tests and integration tests that we use 这些都被忽略了)。将其与我们使用的*单元测试和集成测试*进行比较 continually in software development and scientific programming. 持续从事软件开发和科学编程。
- there is a lot of copy & paste going on, from the Excel sheet into
-有很多的复制和粘贴进行,从Excel表到 a word document used as a report, to a Powerpoint to do a 一个word文档用作报告,到一个Powerpoint做一个 presentation. In data science we also take care to *track the 演示。在数据科学中,我们也注意跟踪 changes to data (data provenance)* and to make sure that a document 更改数据(数据来源)*并确保有文档 references a dataset in a clearly identifyable way. 以明确可识别的方式引用数据集。
2 What we want
我们想要什么
Let's take as an example the planning of a complex budget over several years. 让我们以几年的复杂预算规划为例。
- We would like to have way to put down our budget entries in a form which
-我们希望有办法把我们的预算项目以某种形式记下来 allows adding comments and other meta data to explain the entry. *允许添加注释和其他元数据*来解释条目。
- When we get improved forecast information, we would like to be able and override
-当我们得到改进的预测信息,我们希望能够和覆盖 previous information without deleting it, so that an audit trail is conserved. 以前的信息没有删除它,所以一个*审计跟踪*是守恒的。
- The format in which we write down all of this information should lend itself to
我们写下所有这些信息的格式应该是合适的 automatically generate a data set that can be manipulated with state of the art *自动生成一个数据集*,可以使用最新技术进行操作 tools. 工具。
- The code producing the budget forecasts and plots should be a part of the document,
-编制预算预测和情节的代码应作为文件的一部分, so that we can hand it to somebody else, and she can easily understand how we came 这样我们就可以把它交给别人,这样她就能很容易地理解我们是怎么来的 to the numbers. 的数字。
- The format should be easily manageable by standard revision control systems, so that
-该格式应*易于管理的标准修订控制系统*,以便 we can nicely document the changes over time. 我们可以很好地记录随时间的变化。
- Produce beautiful and readable documents in a mostly automatic way for optimal
- *生成美丽和可读的文件*在大多数自动方式的最佳
readability. 可读性。
- Since our management very often desires Excel or other formats, we
-由于我们的管理层经常要求Excel或其他格式,我们 would also like to *be able and export to these other formats as 还想*能够导出到这些其他格式为 well*... but all from the same source. 好吧*……但都来自同一个源头。
3 Three Musketeers to the Rescue
3三个火枪手来救援
3.1 Reproducible Research
3.1重复性研究
Reproducible research is an important topic in today's IT 可重现性研究是当今IT的一个重要主题 research and it also has a big significance for our society and the 研究它对我们的社会也有很大的意义 ethics of conducting research. It is important to keep the 进行研究的伦理。保持健康是很重要的 underlying data, algorithms, and argumentation, so that others can 基本的数据,算法,和论证,以便其他人可以 follow the analysis of the data and understand the 对数据进行分析并理解 conclusions. Today's interest in SW containers like Docker is apart 结论。今天人们对Docker这样的SW容器的兴趣是不同的 from being a great deployment mechanism also powered by this desire 从一个伟大的部署机制,也由这一愿望 to allow the conservation of computational environments in an 允许计算环境的守恒 easily sharable form. 轻松分享形式。
The concept of Reproducible research is closely related to the 可重复研究的概念与 idea of literate programing that Donald Knuth launched with [[http://www.literateprogramming.com/knuthweb.pdf][his 这是Donald Knuth在http://www.literateprogramming.com/knuthweb.pdf[他的][http://www.literateprogramming.com/knuthweb.pdf]发起的*文字编程*的想法 seminal article]] in 1984. Literate programming environments allow 在1984年。文字编程环境允许 the free mixing of code parts with extensive documentation 代码部分与大量文档的自由混合 parts. The source of a literate programming document contains all 部分。一个有文字的编程文档的来源包含所有 necessary data as well as the code for manipulating the data to 必要的数据以及操作数据的代码 produce results and plots. It also contains all the text and 产生结果和情节。它还包含所有的文本和 argumentation. Out of this source document one then can generate a 论证。从这个源文档中可以生成一个 nice publication, but also other derived information (e.g. a 不错的出版物,但也有其他衍生信息(如a standalone software program and a data set, or an excel sheet to 独立的软件程序和数据集,或excel表格 give to management). 给管理)。
Org mode for Emacs is one such environment, but there are also Org mode for Emacs就是这样一个环境,但是也有这样的环境 others (e.g. look at this list of links maintained by 其他(如查看此链接列表由 reproducibleresearch.net). Another open source environment in rapid reproducibleresearch.net)。另一个快速的开源环境 development is given by Jupyter notebooks. They are very popular 开发是由Jupyter笔记本提供的。它们很受欢迎 among scientists. These frameworks are not only great for 科学家之一。这些框架不仅适用于 publishing research in a reproducible manner, but also are becoming 以一种可复制的方式发表研究,但也正在变得 one of the best teaching aids, since students can easily experiment 最好的教学辅助之一,因为学生可以很容易地进行实验 with self-contained documents. 与独立的文件。
3.2 Software Best Practices
3.2软件最佳实践
Changes of source code should be tracked by *revision control 源代码的更改应该通过*版本控制进行跟踪 systems* (RCS) like git. An optimal tracking is possible, if the 系统* (RCS)像[[https://git-scm.com/] [git]]。最优跟踪是可能的,如果 differences from one version of a document to the next can be 文档的不同版本之间可能存在差异 easily visualized. This usually requires that the document format 轻松可视化。这通常需要文档格式 is in a text based format and not some closed binary format. 是基于文本的格式,而不是一些封闭的二进制格式。 The RCS help us to track changes through time, and we also can RCS帮助我们跟踪时间的变化,我们也可以 add metadata in the form of comments to explain the motivations for the 以注释的形式添加元数据来解释 changes. 的变化。
Humans will make errors, and each time one does something non-trivial, especially *人会犯错误*,特别是每次做一些不平凡的事情 when it is of a repetitive nature, errors easily creep in. Also, doing changes 当它具有重复性时,错误很容易出现。同时,做改变 in one part of a complex thing may cause other parts of our wonderful ivory tower 在一件复杂的事情的一部分可能导致我们美妙的象牙塔的其他部分 computation to collapse. So, in SW engineering the ideas of unit and intergration tests 计算崩溃。因此,在软件工程中,需要考虑单元测试和集成测试 has become an accepted paradigm. 已成为公认的范例。
Versioning. Documents should be versioned in a clearly defined way similar as *版本控制。*文档的版本应该像 to what we do with widely accepted standards like semantic versioning. 到我们使用广泛接受的标准语义版本控制所做的事情。
Templating and Repetition Repetitive parts in a document, e.g. if *模板和重复*文档中重复的部分,例如if for a budget of many services a certain basic text/table/graphics 对于许多服务的预算,一定的基本文本/表格/图形 structure needs to be repeated multiple times, we would like to 结构需要重复多次,我们愿意 generate it by a single code block that we call with different 生成它的一个单一的代码块,我们调用不同的 arguments - and not by copy/paste and manual editing. Senseless 参数-而不是通过复制/粘贴和手动编辑。毫无意义的 repetition or copy/paste is useless human toil and leads to 重复或复制/粘贴是徒劳无功的 errors. later changes to the basic structure should not involve us 错误。以后对基本结构的修改不应该涉及到我们 in making changes in a hundered places. 在混乱的地方做出改变。
3.3 Data Analysis Tools
3.3数据分析工具
There exists a plethora of programming languages and libraries to 存在大量的编程语言和库 deal with data in all of its forms. In particular, Python is a 处理所有形式的数据。特别地,*Python*是一个 language which has found increasing adoption by the scientific 这种语言已被越来越多的科学工作者采用 community over the last decade. It is expressive and provides a 在过去的十年里。它是表达性的,并提供了一个 wealth of libraries and also frameworks for interactive data 丰富的库和交互式数据框架 analysis. Naturally there are other solutions as well, e.g. [[https://www.r-project.org/][the R 分析。当然,也有其他的解决方案,例如https://www.r-project.org/ project]]. 项目]]。
Since I am a Python person, I will recommend the following libraries for tackling 由于我是一个Python人员,所以我将推荐以下库进行处理 the kind of management related problems we are talking about 我们正在讨论的与管理相关的问题
- Pandas: easy-to-use data structures and data analysis tools
- panda:易于使用的数据结构和数据分析工具
- numpy: fundamental package for scientific computing. Used by Pandas
- numpy:科学计算的基本包。使用的熊猫
- matplotlib and seaborn for visualization
- matplotlib和seaborn进行可视化
4 An example workflow using Emacs Org mode and python
4一个使用Emacs Org模式和python的示例工作流
In the following I will demonstrate my own planning workflow that I 在下面我将演示我自己的规划工作流程 built up over the last years using Emacs and its fabulous 在过去的几年里使用Emacs和它的神话 Org-Mode. Org-Mode。
Emacs is a fantastic environment for programmers, since the whole 对于程序员来说,Emacs是一个非常好的环境 editor is an extensible lisp environment with a vibrant community producing editor是一个可扩展的lisp环境,具有活跃的社区生成 new packages. 新包。
4.1 Plan using an outliner
4.1使用outliner进行计划
Outliners are ideal for jotting down thoughts and ideas. Some 外行是记录思想和想法的理想场所。一些 people may prefer mind-maps, but in the end the underlying data 人们可能更喜欢思维导图,但最终还是基础数据 structures are mostly identical. The outliner lists are certainly 结构基本相同。当然,例外的名单 more efficient for a proficient typist and are easily mixed with 更有效率的一个熟练的打字员,很容易混合 other parts of a text. 文本的其他部分。
I have used outlines for years for the initial planning of most tasks. 多年来,我一直使用提纲来初步规划大多数任务。 Actually the approach is now so engrained with me that I use lists as 实际上,这种方法现在对我来说已经根深蒂固了,我使用列表作为 the starting point for almost all documents I write. Their inherent 我写的几乎所有文档的起点。其固有的 simple structure allows to express the main ideas as top level items, and then 简单的结构允许表达的主要思想作为顶层项目,然后 use Sub-items to further refine them or add newer information in a way that 使用子项进一步细化它们或以以下方式添加更新的信息 is transparent. 是透明的。
I also have used this approach for drawing up project and service 我也用这种方法来制定项目和服务 budgets for our scientific computing section over the last years, 过去几年我们科学计算部门的预算, But only now I implemented a better way to automatically derive a 但直到现在我实现了一个更好的方法来自动推出a useful data structure from such a list in a new Emacs package 新Emacs包中此类列表的有用数据结构 org-listcruncher (you can get the package from MELPA). org-listcruncher(您可以从MELPA获取package from MELPA)。
Here is an example of a planning list for a budget of two (non-IT) services. 以下是两个(非it)服务预算的规划列表示例。
My rules for writing such a planning list are 我写这样一个计划清单的规则是
- Each line contains a tag defining wheter the line will become a table row. For this
- 每行包含一个标记,定义该行是否将成为表行。对于这个
example I defined this as the string "item:". Rows without such a tag just serve as 例如,我将其定义为字符串“item:”。没有这样标记的行仅作为 metadata. 元数据。
- A string following the output tag "item:" is taken as the description of the table row.
- 输出标记“item:”后面的字符串作为表行的描述。
- Each line can contain any number of key/value pairs in parentheses in the form
3.每一行都可以在表单的括号中包含任意数量的键/值对
(key1: val1, key2: val2, ...)
(key1: val1, key2: val2,…)
- Lines of lower hierarchical order in the list inherit their default settings for key/values
- 列表中较低层次结构的行继承键/值的默认设置
from the upper items. 从上面的项目。
- The key value of a higher order item can be overwritten by a new new value for the same key
- 高阶项目的键值可以被相同键的新值覆盖
in a lower order line. 在低阶直线上。
Note that org-listcruncher allows the user to supply an own parsing function, so the 注意,org-listcruncher允许用户提供自己的解析函数,因此 rules for how the key/values, the description, and the tagging for a row can get 关于如何获取键/值、描述和行的标记的规则 arbitrarily defined by the end user. The present default function is just an easy example. 由最终用户任意定义。当前的默认函数就是一个简单的例子。
One can easily imagine using the same kind of approach to derive a table of ingredients and their amounts 人们可以很容易地想象用同样的方法得出一个成分表和它们的数量 from the instructions of a cooking recipe that is given as a sequential list of steps. Just needs a bit 从一个烹饪食谱的说明,作为一个连续的步骤列表。只是需要一点 a smarter parsing function. 更智能的解析功能。
4.2 Generate a table from the outline
4.2根据大纲生成表格
We now would like to derive from the planning list a more suitable 现在我们想从中推导出一个更合适的计划表 data format for doing computations. A table is an optimal 用于计算的数据格式。表是最优的 format. The main function of org-listcruncher provides just this 格式。https://github.com/dfeich/org-listcruncher的主函数提供了这个功能 conversion. Based on the semantics defined in the previous section. 转换。基于上一节定义的语义。
The following Org mode code block contains a single line of Emacs
下面的Org模式代码块包含一行Emacs
lisp for calling org-listcruncher's main function and uses the
lisp调用org-listcruncher的主要函数,并使用
above list (which I named lstShipGalante
) to create a table based
上面的列表(我将其命名为=lstShipGalante=)用于创建一个基于表的表
on an "outer join" of all the keys found with the respective values
使用各自的值找到所有键的“外连接”
arranged in columns.
安排在列。
The list from which we generate the table can later be further 稍后,我们可以进一步生成用于生成表的列表 edited, and all derived results will change accordingly when 编辑后,所有导出的结果将随之更改 recalculating the whole document. The audit trail is conserved in 重新计算整个文档。审计跟踪是保守的 the list itself through the comments, and naturally it will also be 名单本身通过评论,自然它也会 conserved in the commits into the revision control system. 保存提交到修订控制系统。
4.3 Data analysis and visualization
4.3数据分析与可视化
Now, as the data is in an easily accessible form, we can use Org mode's Babel feature to 现在,由于数据是易于访问的形式,我们可以使用Org模式的Babel特性 read in the table into a python code block and perform some Pandas data 将表读入python代码块并执行一些panda数据 manipulation magic on it. 上面有操纵魔法。
And another little code block lets us visualize cuts of the data with matplotlib and seaborn. 另一个小代码块让我们可以用matplotlib和seaborn可视化数据切割。
And we can generate the plots for our report. 我们可以为我们的报告生成图表。
4.4 Versioning, templates, testing
4.4版本控制、模板、测试
I will only refer cursorily how to address the other points 我将只简略地提及如何处理其他问题
- Org mode is part of Emacs which offers a number of powerful templating systems. Org itself supports
- Org模式是Emacs的一部分,它提供了许多强大的模板系统。Org本身支持
Macros which can generate text parts. 可以生成文本部分的宏。
- Versioning can easily be combined with org mode documents and integrated in the common ways with
-版本控制可以很容易地与org模式文档相结合,并以通用的方式与之集成 revision control systems like git (e.g. via tags) 像git这样的修订控制系统(例如,通过标签)
- testing: Since org offers native inclusion of code snippets in different programming languages, tests
-测试:由于org提供了不同编程语言的代码片段的本地包含,所以进行测试 can easily be formulated in the avaible test frameworks of these languages. One can also 可以很容易地在这些语言的可用测试框架中表示。一个也可以 define simple tests in form of document code blocks putting out warnings into the document. 以文档代码块的形式定义简单的测试,将警告放入文档中。
6 Publishing
* 6出版
- I submitted to reddit under https://redd.it/8kvvz8
-我在https://redd.it/8kvvz8下提交到reddit
- thanks to jcs from the Irreal blog for reviewing and pointing to my
-感谢来自Irreal博客的jcs的评论和指向我的 article: http://irreal.org/blog/?p=7216 文章:http://irreal.org/blog/?p=7216
- thanks to Sacha Chua for including it in her Emacs news
-感谢Sacha Chua在Emacs新闻中提到这一点 http://sachachua.com/blog/2018/05/2018-05-21-emacs-news/ http://sachachua.com/blog/2018/05/2018-05-21-emacs-news/