ai人工智能的数据服务
Editor’s Note: Preparing data is a crucial and unavoidable part of any data scientist’s job. In this post writer Kate Shoup takes a closer look at the data bottleneck that affects so many projects, and how to address it.
编者注:准备数据是任何数据科学家工作中至关重要且不可避免的部分。 在这篇文章中,作者Kate Shoup仔细研究了影响如此众多项目的数据瓶颈以及如何解决它。
Most people enter the field of data science because “they love the challenge of developing algorithms and building machine learning models that turn previously unusable data into valuable insight,” writes IBM’s Sonali Surange in a 2018 blog post. But these days, Surange notes, “most data scientists are spending up to 80 percent of their time sourcing and preparing data, leaving them very little time to focus on the more complex, interesting and valuable parts of their job.” (There’s that 80% figure again!)
IBM的Sonali Surange在2018年的博客文章中写道,大多数人进入数据科学领域是因为“他们喜欢开发算法和构建机器学习模型的挑战,这些挑战将以前无法使用的数据转化为有价值的见解。” 但是,Surange指出,如今,“大多数数据科学家将80%的时间都花在寻找和准备数据上,而使他们几乎没有时间专注于工作中更复杂,有趣和有价值的部分。” (又有80%的数字!)
This bottleneck in the data-wrangling phase exists for various reasons. One is the sheer volume of data that companies collect — complicated by limited means by which to locate that data later. As organizations “focus on data capture, storage, and processing,” write Limburn and Taylor, they “have too often overlooked concerns such as data findability, classification and governance.” In this scenario, “data goes in, but there’s no safe, reliable or easy way to find out what you’re looking for and get it out again.” Unfortunately, observes Jarmul, the burden of sifting through this so-called data lake often falls on the data science team.
出于各种原因,存在数据整理阶段的瓶颈。 一是公司收集的数据量之庞大-后来通过有限的方式定位这些数据而变得复杂。 Limburn和Taylor表示,由于组织“专注于数据捕获,存储和处理”,因此“他们经常忽视诸如数据可查找性,分类和治理之类的问题。” 在这种情况下,“数据进入了,但是没有安全,可靠或简单的方法来找出您要查找的内容并再次将其取出。” 贾穆尔(Jarmul)指出,不幸的是,筛选这个所谓的数据湖的负担通常落在数据科学团队身上。
Another reason for the data-wrangling bottleneck is the persistence of data silos. Data silos, writes AI expert Edd Wilder-James in a 2016 article for Harvard Business Review, are “isolated islands of data” that make it “prohibitively costly to extract data and put it to other uses.” Some data silos are the result of software incompatibilities — for example, when data for one department is stored on one system, and data for another department is stored on a different and incompatible system. Reconciling and integrating this data can be costly. Other data silos exist for political reasons. “Knowledge is power,” Wilder-James explains, “and groups within an organization become suspicious of others wanting to use their data.” This sense of proprietorship can undermine the interests of the organization as a whole. Finally, silos might develop because of concerns about data governance. For example, suppose that you have a dataset that might be of value to others in your organization but is sensitive in nature. Unless you know exactly who will use that data and for what, you’re more likely to cordon it off than to open it up to potential misuse.
造成数据争用瓶颈的另一个原因是数据孤岛的持续存在。 AI专家Edd Wilder-James在2016年的《哈佛商业评论》( Harvard Business Review)文章中写道,数据孤岛是“孤立的数据孤岛”,这使得“提取数据并将其用于其他用途的成本过高。” 某些数据孤岛是软件不兼容的结果,例如,一个部门的数据存储在一个系统上,而另一部门的数据存储在另一个不兼容的系统上。 协调和集成此数据可能会非常昂贵。 由于政治原因,还存在其他数据孤岛。 Wilder-James解释说:“知识就是力量”,组织中的组对其他想要使用其数据的人也产生了怀疑。 这种所有权意识会损害整个组织的利益。 最后,由于对数据治理的关注,孤岛可能会发展。 例如,假设您有一个数据集,该数据集可能对组织中的其他人有价值,但本质上是敏感的。 除非您确切知道谁将使用该数据以及将其用于什么,否则您很可能将其封锁,而不是将其开放给潜在的滥用。
In addition to prolonging the data-wrangling phase, the existence of data lakes and data silos can severely hamper your ability to locate the best possible data for an AI project. This will likely affect the quality of your model and, by extension, the quality of the broader organizational effort that your project is meant to support. For example, suppose that your company’s broader organizational effort is to improve customer engagement, and as part of that effort it has enlisted you to design a chatbot. “If you’ve built a model to power a chatbot and it’s working against data that’s not as good as the data your competitor is able to use in their chatbot,” says Limburn, “then their chatbot — and their customer engagement — is going to be better.”
除了延长数据处理阶段之外,数据湖和数据孤岛的存在还会严重妨碍您为AI项目找到最佳数据的能力。 这很可能会影响模型的质量,进而影响到项目打算支持的更广泛的组织工作的质量。 例如,假设您公司的更广泛的组织工作是提高客户参与度,并且作为该工作的一部分,它已邀请您设计一个聊天机器人。 “如果您建立了一个为聊天机器人提供动力的模型,并且它所处理的数据不如竞争对手在其聊天机器人中可以使用的数据那么好,” Limburn说,“那么他们的聊天机器人和他们的客户参与度将会越来越高变得更好。”
解决方案 (Solutions)
One way to ease the data-wrangling bottleneck is to try to address it up front. Katharine Jarmul champions this approach. “Suppose you have an application,” she explains, “and you’ve decided that you want to use activity on your application to figure out how to build a useful predictive model later on to predict what the user wants to do next. If you already know you’re going to collect this data, and you already know what you might use it for, you could work with your developers to figure out how you can create transformations as you ingest the data.” Jarmul calls this prescriptive data science, which stands in contrast to the much more common approach: reactionary data science.
缓解数据困扰瓶颈的一种方法是尽力解决这个问题。 凯瑟琳·贾穆尔(Katharine Jarmul)支持这种方法。 她解释说:“假设您有一个应用程序,并且您已经决定要在应用程序上使用活动,以弄清楚以后如何构建有用的预测模型来预测用户接下来要做什么。 如果您已经知道要收集这些数据,并且已经知道可以将其用作什么,则可以与开发人员合作,确定如何在摄取数据时创建转换。” 贾穆尔(Jarmul)将此称为规范性数据科学,这与更为常见的方法形成鲜明对比:React性数据科学。
Maybe it’s too late in the game for that. In that case, there are any number of data catalogs to help data scientists access and prepare data. A data catalog centralizes information about available data in one location, enabling users to access it in a self-service manner. “A good data catalog,” writes analytics expert Jen Underwood in a 2017 blog post, “serves as a searchable business glossary of data sources and common data definitions gathered from automated data discovery, classification, and cross-data source entity mapping.” According to a 2017 article by Gartner, “demand for data catalogs is soaring as organizations struggle to inventory distributed data assets to facilitate data monetization and conform to regulations.” Examples of data catalogs include the following:
也许为时已晚。 在这种情况下,可以使用任意数量的数据目录来帮助数据科学家访问和准备数据。 数据目录将有关可用数据的信息集中在一个位置,使用户能够以自助方式访问它。 分析专家Jen Underwood在2017年的博客文章中写道:“一个很好的数据目录,是可搜索的数据源和从自动数据发现,分类和跨数据源实体映射收集的通用数据定义的业务词汇表。” 根据Gartner在2017年发表的一篇文章,“由于组织努力清点分布式数据资产以促进数据货币化和遵守法规,数据目录的需求激增。” 数据目录的示例包括:
- Microsoft Azure Data Catalog Microsoft Azure数据目录
- Alation CatalogAlation目录
- Collibra CatalogCollibra目录
- Smart Data Catalog by Waterline水线智能数据目录
- Watson Knowledge Catalog沃森知识目录
In addition to data catalogs to surface data for AI projects, there are several tools to facilitate other data-science tasks, including connecting to data sources to access data, labeling data, and transforming data. These include the following:
除了用于AI项目的地表数据的数据目录外,还有多种工具可以促进其他数据科学任务,包括连接到数据源以访问数据,标记数据和转换数据。 其中包括:
Database query toolsData scientists use tools such as SQL, Apache Hive, Apache Pig, Apache Drill, and Presto to access and, in some cases, transform data.
数据库查询工具数据科学家使用SQL,Apache Hive,Apache Pig,Apache Drill和Presto等工具来访问数据,并在某些情况下转换数据。
Programming languages and software librariesTo access, label, and transform data, data scientists employ tools like R, Python, Spark, Scala, and Pandas.
编程语言和软件库为了访问,标记和转换数据,数据科学家采用了R,Python,Spark,Scala和Pandas等工具。
NotebooksThese programming environments, which include Jupyter, IPython, knitr, RStudio, and R Markdown, also aid data scientists in accessing, labeling, and transforming data.
笔记本电脑这些编程环境(包括Jupyter,IPython,knitr,RStudio和R Markdown)还可以帮助数据科学家访问,标记和转换数据。
Kate Shoup grew up reading under the covers with a flashlight well past her bedtime. Now, Kate does more than just read books — she edits and writes them, too. For more than 20 years Kate has worked as an independent publishing professional. She has written more than 50 books on a mish-mash of topics and edited hundreds more.
凯特·舒普(Kate Shoup)在睡觉前用手电筒在书本下读书。 现在,凯特(Kate)不仅可以读书,而且还可以编辑和写书。 凯特(Kate)从事独立出版专业已有20多年。 她撰写了50多本关于主题混搭的书,并编辑了数百本。
翻译自: https://medium.com/oreillymedia/getting-your-data-ready-for-ai-efdbdba6d0cf
ai人工智能的数据服务