METHOD AND SYSTEM FOR MANAGING WORKFLOWS FOR AUTHORING DATA DOCUMENTS
20230342383 · 2023-10-26
Inventors
Cpc classification
International classification
Abstract
A method and system for managing workflows receives a text string being typed within a data document and executes a connection engine that performs natural language processing (NLP) to extract words and phrases having keywords corresponding to data operations, parse the text string into nested nodes including sub-phrases of arguments and keywords. The arguments and keywords are assembled into one or more complete data operation which is executed to return matching results from within a dataset as dependent phrase candidates to complete the text string. The writer selects a candidate from the dependent phrase candidates in response to which the connection engine creates a persistent text-data connection between the selected candidate and the dataset. This persistent text-data connection automatically updates the selected candidate when one or more of the dataset, arguments, and keywords are modified.
Claims
1. A method for managing workflows for authoring data documents, wherein one or more dataset is retrieved from a data source, the method comprising: using a computing device to: receive a text string within a data document being generated by at least one writer; execute a connection engine configured to perform natural language processing (NLP) to: extract from within the text string words and phrases having keywords corresponding to data operations within a predefined operation dictionary; parse the text string into a plurality of nested nodes comprising sub-phrases comprising independent data phrases and keywords; assemble the independent data phrases and data operations in one or more node of the plurality of nested nodes into one or more complete data operation; and execute the one or more complete data operation and return matching results from the one or more dataset as one or more dependent phrase candidate to complete the text string; prompt the at least one writer to select a selected candidate from the one or more dependent phrase candidates; and create a persistent text-data connection between the selected candidate and the one or more dataset; wherein the persistent text-data connection is configured to automatically update the selected candidate when one or a combination of the one or more dataset, the independent data phrases, and the keywords is modified by the writer.
2. The method of claim 1, wherein the data operations comprise one or a combination of Retrieve Value, Filter, Find Extremum, Compute Derived Value, Determine Range, Find Anomalies, and Compare.
3. The method of claim 1, wherein the data operations have arguments comprising one or more independent data phrases or an output of another data operation.
4. The method of claim 1, wherein the one or more dataset comprises a table, wherein the independent data phrases and the output are a row, a column, or a value in the table.
5. The method of claim 4, wherein the connection engine is further configured to update the table to add a new row or a new column in response to computation of a dependent phrase.
6. The method of claim 4, wherein the table is embedded within the data document.
7. The method of claim 1, wherein the dependent data phrase comprises an output of one or more computation by the data operations, the output comprising a derived value that does not exist in the dataset.
8. The method of claim 1, wherein the one or more dataset comprises a chart embedded within the data document.
9. The method of claim 1, wherein the step of parsing the text string uses a context-free grammar, wherein a structure of the plurality of nested nodes is independent of a context of the text string.
10. The method of claim 1, where the connection engine is further configured to generate potential independent phrases within an incomplete text string by performing string matching with all strings in the dataset and synonym matching with all attribute names in the dataset.
11. A computer system, comprising: a computing device; memory configured to store program instructions, wherein, when executed by the computing device, the program instructions cause the computer system to perform one or more operations comprising: receiving a text string within a data document being generated by at least one writer; executing a connection engine configured to perform natural language processing (NLP) to: extract from within the text string words and phrases having keywords corresponding to data operations within a predefined operation dictionary; parse the text string into a plurality of nested nodes comprising sub-phrases comprising independent data phrases and keywords; assemble the independent data phrases and data operations in one or more node of the plurality of nested nodes into one or more complete data operation; and execute the one or more complete data operation and return matching results from the one or more dataset as one or more dependent phrase candidate to complete the text string; prompt the at least one writer to select a selected candidate from the one or more dependent phrase candidates; and create a persistent text-data connection between the selected candidate and the one or more dataset; wherein the persistent text-data connection is configured to automatically update the selected candidate when one or a combination of the one or more dataset, the independent data phrases, and the keywords is modified by the writer.
12. The computer system of claim 11, wherein the data operations comprise one or a combination of Retrieve Value, Filter, Find Extremum, Compute Derived Value, Determine Range, Find Anomalies, and Compare.
13. The computer system of claim 11, wherein the data operations have arguments comprising one or more independent data phrases or an output of another data operation.
14. The computer system of claim 11, wherein the one or more dataset comprises a table, wherein the independent data phrases and the output are a row, a column, or a value in the table.
15. The computer system of claim 14, wherein the connection engine is further configured to update the table to add a new row or a new column in response to computation of a dependent phrase.
16. The computer system of claim 14, wherein the table is embedded within the data document.
17. The computer system of claim 11, wherein the dependent data phrase comprises an output of one or more computation by the data operations, the output comprising a derived value that does not exist in the dataset.
18. The computer system of claim 11, wherein the one or more dataset comprises a chart embedded within the data document.
19. The computer system of claim 11, wherein the step of parsing the text string uses a context-free grammar, wherein a structure of the plurality of nested nodes is independent of a context of the text string.
20. The computer system of claim 10, where the connection engine is further configured to generate potential independent phrases within an incomplete text string by performing string matching with all strings in the dataset and synonym matching with all attribute names in the dataset.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0022] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
DETAILED DESCRIPTION OF EMBODIMENTS
[0035] As used herein, “document” means a text-containing work of authorship that is generated by a person (a “writer”) using a word processing or writing application. “Document” includes data documents which employ text, tables, and visualizations to report findings from data analyses and present data-rich narratives within the document. The document may be a report, manuscript, thesis, presentation materials, and other text-containing writings. By way of example but not limitation, the documents may be created by programs such as Microsoft® Word®, Microsoft® PowerPoint®, Apple® Pages®, Corel® WordPerfect®, Google® Docs®, and others.
[0036] As used herein, “writer” means one or more person who uses a software-based document creation tool to create or generate a document, i.e., a work of authorship. The terms “writer”, “author”, and “user” may be used interchangeably for the person. More than one person may be the writer of a given document in a collaboration. A “writer” may also include a person who is reviewing, editing, and/or revising a document.
[0037] The inventive approach identifies and leverages connections that exist between highly descriptive text and data to facilitate creation of data documents. Instead of requiring users to manually specify data-driven bindings using programming languages, the CrossData™ system infers and recommends connections that implicitly exist between text and data to the user during the writing process. These bindings, when coupled with a set of novel interaction techniques, enable users to easily select and update text-data connections. The CrossData™ system not only significantly reduces the manual effort needed to create data documents - it simultaneously enables an interactive reading experience for readers without any additional effort.
[0038] Perhaps the most closely related work to the inventive approach is Crosspower™, disclosed in International Patent Application PCT/US21/55058 (International Publication No. WO 2022/081891), which is incorporated herein by reference. Crosspower™ leverages desired correspondences between linguistic structures and graphical structures to allow users to flexibly and quickly create and manipulate graphical elements, as well as their layouts and animations. While the inventive CrossData™ scheme also supports content creation, it focuses on the domain of data documents, which involves a different set of interaction techniques to coherently address challenges that are often encountered when authoring data documents.
[0039] The inventive CrossData™ approach is based upon natural language processing (NLP) techniques but differs significantly from prior art NLP-based approaches. Highly descriptive text is viewed as another representation of the underlying data, so it is important to preserve the connections that exist between the text and data. These persistent connections are then leveraged to provide rich interactions that can be used during the writing process.
[0040] To better understand the general workflow, pain points, and best practices involved in creating data documents, a formative interview study was conducted. Eight professionals from various domains, including business services, e-commerce, accounting, banking, biomedical science, retail, and internet services were interviewed (four female, aged 27 - 30). Each participant had three to seven years’ experience working in their current role. Their responsibilities included exploring, analyzing, and reporting data. Interviews were conducted remotely using videotelephony and lasted between 45 to 60 minutes.
[0041] During the interviews, the participants were asked to describe a recent memorable experience while writing data documents, common pain points, and how they resolved the situation. They were also asked to share their documents and tools through screen sharing, if possible. The interview ended with a questionnaire to collect demographic information. Four pilot interviews with another four professionals were conducted beforehand to develop the study protocol.
[0042] Interviews were audio-recorded, transcribed, and analyzed using a reflexive thematic analysis. The codes and themes were generated both inductively (i.e., bottom-up) and deductively (i.e., top-down), focusing on the workflow breakdowns, repetitive operations, and workarounds that occurred while writing data documents.
[0043] The general process of producing data documents involved data exploration and writing. During the exploration stage, participants cleaned, processed, and explored their data with a concrete goal or question assigned to them by their supervisor. Microsoft® Excel®, the widely-available spreadsheet, was the most common data tool used for this process. All participants said that when insights and findings were discovered within the data, they would “create or screenshot the table or chart (of the insights), insert it to a Microsoft® Word® document, and write a short description for it”. After accumulating enough insights, participants moved to the writing stage. All participants indicated that they frequently revisited the data during the writing process, as their original insights could be unclear, complicated, incorrect, obsolete, or unappealing to present. The document would often be reviewed, edited, and/or modified by collaborators, leading to additional data exploration. Thus, the writing processes were highly intermixed with data exploration. Finally, the document would be carefully reviewed alongside the data to ensure that there were no inconsistencies between the document and data before the final version was delivered.
[0044] During the process of generating the data documents, participants needed to retrieve data from the data analysis applications (e.g., Excel®) to incorporate into the authoring, i.e., word processing, applications they were using (e.g., Word®). All participants reported that the need for “frequent application switching and navigation to the data” led to significant problems within the retrieval process. For example, with Excel®, participants needed to first identify the correct datasheet, and then navigate within the sheet to locate the data they wanted to access. Participants would often use the “search” or “find” function to accelerate their navigation, which required them to remember specific data properties and navigation pathways when multiple matches were found. Once data was located, participants needed to transfer it to a text editor. While participants frequently relied on copy-and-paste operations to avoid transcription errors, they often needed to change the data format. For example, the process may involve converting large absolute values to abbreviated forms or performing simple calculations such as a ratio of change. This typically forced them to manually type the data into the document after performing the conversion or calculation which could require opening a third application to perform the calculation. Each of these steps was tedious and often had to be repeated several times during authoring, resulting in time-consuming and error-prone workflows.
[0045] To create an accurate finished document, it is of critical importance to ensure consistency between the document and its underlying data. Erroneous data reporting can insert delays into the finalization of an important document due to the need for additional review and revisions of the document by others. It can lead to negative performance evaluations for the person originally assigned to handle the project, and in a worst case scenario, inaccurate data can cause financial and reputational losses for a company. Professionals reported that the inconsistencies were usually caused by data updates. For example, one participant, a marketing manager, often started to draft a document before all data became available in order to meet deadlines. This required them to update their analysis and document as soon as new data became available. Another participant who worked in a financial services company frequently was required to update her documents when there were adjustments in model parameters. Whenever the underlying data was updated, all participants reported that they needed to “read through [their] documents carefully and fix the inconsistent content manually”, which was “inefficient and prone to error”. One commenter noted that the IT team in his company developed a plugin that synchronized the data between Excel® and Word® automatically, however, it required the user to manually connect cells in the spreadsheet to text in the document. Another commenter mentioned that a professional review team in her company would proofread her documents to highlight any inconsistencies. Overall, these methods were considered to be cumbersome, expensive, and time-consuming.
[0046] Participants reported that exploring different ways to present data was a common but time-consuming task. They needed to perform additional data exploration during the writing stage, because “only when I write down the data in the document, I know what’s the best way to present it”. One participant who worked as an operating officer in an IT company reported that she frequently needed to switch growth period data covered by presentations between yearly, quarterly, and monthly.
[0047] Exploring alternative data presentations was reported as being time-consuming, because participants often needed to repeat their analysis steps, create new tables and charts, and update the relevant text with new data. One commenter mentioned she always used tables or charts to show evidence for the insights reported in the text: “if I want to report a new metric, I will add one more column to the table.” Another commenter noted that to “add one more sentence” to introduce “the ratio of a group of users to all users”, he needed to go back to Excel®, perform multiple operations to re-create tables and charts, and then insert them into the document.
[0048] Participants reported that during the writing stage, they frequently had to go through multiple iterations on the presentation of data. Even the smallest changes could initiate significant ripple effects to the data reported in the text, as well as the corresponding tables and charts. With such significant overhead, participants and their collaborators had to iterate on the document offline when iterations were suggested in real-time, requiring additional meetings and discussions, thus hindering their collaborative process.
[0049] In summary, the formative study found that professionals encountered numerous issues during the process of writing data documents with mainstream tools and that they were forced to address these issues manually. They struggled while inputting the data into their documents, maintaining the consistency between their documents and data, and handling the numerous interconnected components during iterations. The findings indicate that the key reason for their tedious and ineffective workflows was the lack of connection between the text in data documents and the data in datasets. The solution is, thus, to create connections that could be maintained with minimal effort by the users.
[0050] When using text to describe data from a dataset in a document, a user establishes an abstract connection between the text and the data elements in their mind. A key insight from the formative study was that current tools require the user to mentally maintain these connections, leading to tedious, repetitive, and error-prone operations. The inventive solution is to reify these connections as persistent, first-class objects and leverage them to address the issues that occur during the writing process. To this end, two steps were undertaken: Step 1) a Connection Engine was developed to automatically establish and maintain these connections during writing processes, and Step 2) a set of interactions was designed based on these connections to tackle the issues identified in the formative study. The implementation presented in the following description focuses on tabular data, which is one of the more common data formats. Application of the inventive CrossData™ approach to other data formats will become apparent to those of skill in the art based upon this example.
[0051]
[0052] In step 1 of the CrossData™ process, the Connection Engine establishes text-data connections. Given the text in a data document and an underlying dataset, the goal is to infer, establish, and maintain connections between the text in the document and the corresponding data in the data analysis application, e.g., Excel® worksheet or similar.
[0053] Referring to
[0054] Independent data phrases 202 directly report items (rows), attributes (columns), and values (cells) in the dataset. For example, in
[0055] Dependent data phrases 210 (item (c)) present the output of data operations that take other data phrases as arguments. A dependent data phrase can report data in the dataset or derived values that do not exist in the dataset. For example, the last term “1.0” (214) is calculated based on the other phrases and connects to the data dependently. The data operations to compute a dependent data phrase are described by keywords 212 (in blue text) such as “from”, “to”, “of”, and “increased”.
[0056] Referring to
[0057] To establish connections for independent data phrases, Connection Engine 302 generates potential independent phrases for P.sub.cur 306 by performing string matching of P.sub.cur with all strings in the dataset and synonym matching with all attribute names in the dataset. The synonym matching is achieved by calculating the similarity of the word embeddings provided by spaCy, an open-source industrial-strength NLP toolkit with built-in support for trainable pipeline components such as named entity recognition, part-of-speech tagging, dependency parsing, text classification, entity linking, and more. (spaCy is published under the MIT License.) All matches will then be returned as suggestions, ordered by their matching scores. When the writer selects a suggestion, an independent phrase will be inserted and create a connection between the independent phrase and the underlying dataset. For example, if the writer selects “Jack” as their choice for “user”, the dataset for Jack will be connected.
[0058] Since dependent data phrases are the result of data operations that take other phrases as arguments, Connection Engine 302 takes three steps to identify, assemble, and execute the data operations, and then returns the results of the data operations as suggestions to the writer. Selection of a suggestion by the writer will insert a dependent data phrase and establish a connection with the underlying data operation.
[0059] 1. (Step 322) Identifying data operations: To detect data operations, Connection Engine 302 matches words and phrases with keywords within a predefined operation dictionary. The dictionary is derived from Amar et al.’s work (“Low-level Components of Analytic Activity in Information Visualization”, in Proc. of InfoVis. IEEE, 2005, pp.111-117, incorporated herein by reference) which summarizes ten low-level analytical operations for data analysis. Table 1 below lists the ten operations defined by Amar et al.:
TABLE-US-00001 Operation Operation Retrieve Value Determine Range Filter Characterize Distribution Compute Derived Value Find Anomalies Find Extremum Cluster Sort Correlate
[0060] The summarization by Amar et al. has been widely used in NLI systems to extract desired data operations from users’ input queries. An operation takes a few arguments as input and outputs either an item (row), an attribute (column), a value (cell), or a derived value of the underlying dataset. Table 2 lists the arguments, outputs, and keywords for seven operations implemented in the prototype system.
TABLE-US-00002 Operation Arguments Output Kevwords Retrieve Value row, column value be, report, at, from, rise, drop, increase, decrease, decline, fail, compare with, etc. Filter value, column (optional, default as the value’s column] rows after, before, since, in, until, more, high, over, higher, greater, larger, bigger, under, less, lower, lesser, smaller, between, etc. Find Extremum rows, column (optional, default as all) value rank, max, maximum, highest, greatest, largest, biggest, most, min, minimum, smallest, lowest, least, heaviest, lightest, best, worst, etc. Compute Derived Value rows, column value median, average, mean, sum, total, etc. Determine Range rows, column value range, extent, from... to..., etc. Find Anomalies rows, column value outlier, except, apart from, etc. Compare row1, row2, column value compare, down, different from, etc.
[0061] In the examples illustrated in the figures, keywords are shown with blue letters. In the example shown in
[0062] 2. (Step 324) Assembling data operations with arguments: As an operation needs arguments to compute output, the arguments of an operation can either be independent data phrases or the output of other operations. To infer the arguments for each operation, we parse the input text as a constituency tree using the Berkeley Neural Parser through its integration with spaCy. (N. Kitaev, et al., “Multilingual Constituency Parsing with Self-Attention and Pre-Training”, In Proc. of ACL. ACM, 2019, pp.3499-3505, incorporated herein by reference.). The Berkeley Neural Parser annotates a sentence with its syntactic structure by decomposing it into nested sub-phrases. Within a constituency tree, each node represents a text phrase in the sentence (e.g., noun phrase (“NP”), verb phrase (“VP”), and prepositional phrase (“PP”), with smaller phrases being deeper in the tree, i.e., the leaf nodes are words. Therefore, Connection Engine 302 uses a bottom-up order to recursively examine whether the independent data phrases and operations in a node can be assembled as a complete data operation, as well as whether data operations should be assembled as compounded data operations. Connection Engine 302 employs a rule-based method to achieve the examination, as explored in earlier NLI research. Specifically, Connection Engine 302 matches the set of phrases and their grammatic relationships (also provided by spaCy) of a node with pre-constructed rules, each of which describes the necessary arguments for a data operation and the required data types (i.e., item, attribute, or value) for the arguments.
[0063] 3. (Step 326) Executing data operations: Finally, Connection Engine 302 executes the data operation in the root node of the sentence to obtain the result. Since a keyword may match different operations, Connection Engine 302 employs a greedy strategy to enumerate all possible matched operations for a keyword, assemble them into complete operations. In Step 328, the engine returns all the results as dependent phrase candidates for the writer who, in Step 330, selects the appropriate or desired suggestion(s). In Step 332, the writer’s selection of a suggestion creates a persistent text-data connection between the document that is being created and the data record that supports the text within the document to which it relates, thus creating an interactive document (Step 334).
[0064] The pseudocode for assembling data operations to compute dependent phrases is provided below:
TABLE-US-00003 Input: The root node of the constituency tree of S.sub.former Output: The operation to compute the dependent phrases 1 Function InferDepPhrase (node): // The leaf node represents a word in the sentence. // Return it if it is an operation or data phrase. 2 if node is leaf then 3 if node is operation then 4 return node, None 5 if node is data phrase then 6 return None, node 7 return None, None 8 // Collect the output from its child nodes. 9 Ops = { } 10 DPs = {} 11 foreach chìld _node in node do 12 child_Ops, child_DPs = InterDepPhrase(child_node) 13 Ops = Ops ∪ child_ Ops 14 DPs = DPs ∪ child_DPs 15 // Assemble incomplete operations with arguments. 16 complete)_Ops = { } 17 foreach incomplete_Op in Ops do // See whether the incomplete operation and other operations // or data phrases can be assembled as a complete one. 18 argument_Ops, argument_DPs = CanAssembleWith(incomplete_Op, Ops \ /incomplete_Opl,DPs) // If can, assemble them and update the variables. 19 if argument_Ops or argument_DPs is not None then 20 Ops = Ops \ {argument_Ops U {incomplete_Op}) 21 DPs = DPs \ argument_DPs 22 complete_Ops = complete_Ops U Assemble(incomplete_Op, argument_Ops,argument_DPs) 23 24 Ops = Ops U complete_Ops 25 if node is the root then 26 return Ops 27 else 28 return Ops, DPs
[0065] Referring to
[0066] Parsing the sentence as a constituency tree is a core step to generate dependent phrase candidates. However, a review of constituency trees for successful cases revealed that even if the constituency trees were parsed from incomplete sentences or parsed incorrectly, the connection engine could still output the correct candidates.
[0067] First, the constituency parsing is built based on a context-free grammar, which means the tree structure parsed from a segment of text is not dependent on its context. Thus, even if the sentence is incomplete, the engine can still leverage the constituency tree, the local structure of which will not change when new text is appended.
[0068] Second, the connection engine is sufficiently robust to handle incorrect constituency tree as it leverages: 1) existing independent data phrases selected by the user, and 2) redundant information in the constituency tree. For example,
[0069] Each operation needs arguments to compute the output. The arguments of an operation can either be independent data phrases or the output of other operations. (See, e.g., Table 2.) In the present embodiment using data in tabular format, the types of independent data phrases and output of operations can be row, column, or value. An incomplete operation will be assembled with the data phrases that match its argument types. The actual implementation of the operation detection and assembling was partially inspired from NL4DV, the natural language toolkit for data visualization available from the Georgia Institute of Technology. NL4DV is a Python package that takes as input a tabular dataset and a natural language query about that dataset. In response, the toolkit returns an analytic specification modeled as a JSON object containing data attributes, analytic tasks, and a list of Vega-Lite specifications relevant to the input query.
[0070] CrossData™ leverages the text-data connections found by the Connection Engine to provide novel interactions that address the issues identified in the formative study, thus enabling users to efficiently retrieve, compute, explore data, and adjust tables and charts during the writing of data documents, while automatically maintaining data consistency between the text, data, tables, and charts.
[0071] Connections for Inputting Data: The formative study found that data retrieval is tedious and must be repeated several times when authoring data documents. Professionals manually retrieved data from data analysis tools (e.g., Excel®), leading to issues while application switching, navigating data, and transferring data into writing tools (e.g., Microsoft® Word®). To address these issues, several interactions that enable users to leverage the output of the Connection Engine were thus designed.
[0072] Retrieving Data: As a user types in the text editor, CrossData™ automatically runs the Connection Engine 302 to detect the connections. Referring to
[0073] Computing Values: Occasionally, the user needs to compute and input values that do not exist in the dataset. CrossData™ detects these dependent connections and calculates their derived value using the Connection Engine 302. As shown in
[0074] Using Placeholders: An issue when retrieving or computing data in a written sentence, which differs from command-like sentences in other NLIs systems, is that the data that one may want to retrieve or compute could be input before its dependency is retrieved or computed. CrossData™ thus provides a set of placeholders, such as “Diff” (difference), “Ratio”, and “Count”, which the writer can employ to indicate expected data types. For example, in
[0075] Fixing Misdetections: In some situations, it is possible that CrossData™ may retrieve or calculate incorrect data for dependent data phrases. The incorrectness might be the result of mis-detected dependencies (i.e., wrong input) or operation keywords (i.e., wrong tasks). Referring to
[0076] Connections to Maintain Consistency: The formative interviews demonstrated that most of the professionals manually maintained consistency between their text and data and considered this process to be time-consuming and error-prone. With the help of preserved connections, CrossData™ can update data phrases and highlight problematic operation keywords to help users maintain consistency.
[0077] Data-driven Updates: Whenever a data element within the underlying dataset is updated, CrossData™ automatically updates all independent and dependent phrases that connect to the data element. In the example shown in
[0078] Operation Keywords Checker: Inconsistencies can also occur between the operation keywords and the data. For example, when changing the score of the first row in table 802a from “3.5” (item (a)) to 4.5 in table 802b (item (d)), the operation keyword “increase” is inconsistent with the data. However, different from data phrases, updating operations can be challenging because operation phrases are usually text descriptions. In such cases, CrossData™ may highlight the problematic operation keyword 812 to alert the writer. In the illustrated example, a red wavy underline (item (g)) is shown.
[0079] When iterating on a data document, writers frequently change various elements in their document. While the interaction techniques introduced above can alleviate the overhead of retrieving values and maintaining consistency during iteration, a pressing and unaddressed challenge is the cascading effects that occur when changes are made to text, tables, and charts.
[0080] The inventive CrossData™ approach addresses this challenge by reifying text-data connections as interactive objects, which enable users to manipulate them to iterate on data documents and explore new insights directly in a document. Because the data phrases, tables, and charts are all connected with the underlying data, the necessary changes can be automatically performed without additional user effort.
[0081] Interacting with Data-Driven Text: Text phrases that are connected with underlying data can be interactively manipulated. Independent phrases represent an item (row), attribute (column), or value (cell) within the spreadsheet. Referring to
[0082] Writers often need to iterate on the metrics they use to report on their data, such as changing the average value to the median value or from a daily basis to a weekly basis. CrossData™ allows writers to interactively alter operation keywords to achieve such goals. For example, by hovering the pointer over keyword 906 (item (a)), the writer can click and change the “mean” to another computation such as “total”, “maximum”, or “median”. The available operation keyword alternatives may be predefined within a curated dictionary. (See, e.g., Table 2.)
[0083] Automatic Adjustments of Tables and Charts: Because the text, tables, and charts embedded in a document are all connected to their underlying data, CrossData™ automatically updates tables and charts with the text to ensure the textual descriptions and data visualizations are consistent. Referring to
[0084] Similarly, embedded charts may also be synchronized with textual descriptions. CrossData™ automatically updates the charts if different data properties are reported in the text. For example, when the writer switches the reporting of new infection cases from daily, as shown in
[0085] Connection Engine Evaluation: The effectiveness of the CrossData™ approach depends on whether the Connection Engine can suggest the correct data phrases to the user. A technical evaluation was conducted to assess the accuracy and robustness of the Connection Engine.
[0086] Methodology: The goal of the evaluation is to assess whether the Connection Engine can suggest the correct data phrases based on the text in the writing process. Because independent data phrases are suggested based on string matching, which is usually highly accurate, we focused on evaluating the generation of dependent data phrases. Specifically, we gathered a corpus of sentences together with their corresponding datasets. For each sentence, we manually labelled all independent data phrases with the connections to the datasets as part of the input and all dependent data phrases as ground truth. We then input each sentence word by word into the Connection Engine to simulate a realistic writing experience and compared the suggested dependent phrases against the ground truth. The experiment was run on an Apple® Macbook® Pro with a i7 2.2 GHz Intel® CPU.
[0087] Dataset: We collected sentences from 10 data documents from reputable public sources that cover multiple domains, such as World Health Organization, Bureau of Labor Statistics, Pew Research Center, National Center for Education Statistics, National Institutes of Health, California Department of Public Health, and a private company, as well as their corresponding datasets. We sampled the sentences by: 1) manually filtering all sentences that reported data in the documents, and 2) randomly sampling no more than 30 sentences from each document. For each sentence, we manually labeled the independent and dependent phrases. In total, the corpus contained 206 sentences (5398 words), with 807 independent phrases and 529 dependent phrases.
[0088] Metrics: We measured the ratio of correct dependent data phrases recommended by the Connection Engine to the total number of dependent data phrases. When the engine returned multiple candidates for a dependent phrase, we counted it as correct if the top 5 candidates contained the correct one. We also measured the time to compute the candidates.
[0089] Results: The accuracy of the dependent phrases was 88.8% (i.e., 470 corrects), which demonstrates the robustness and accuracy of the Connection Engine. Among these correct cases, the majority were computed by the compounded operation of filtering and retrieving values (i.e., 262 cases, 55.7%), the finding extreme operations (i.e., 62 cases, 13.2%), the compounded operation of finding extreme operations and retrieving values (i.e., 61 cases, 13.0%), and the compounded operation of finding extreme operations and comparing values (i.e., 48 cases, 10.2%). This echoes the findings from the formative study discussed above, reflecting that the data retrieval operation was prevalent in real world data documents. The average time to generate candidates was 0.3 seconds, which was sufficient for interactive use cases and could be further optimized with better implementations.
[0090] We further investigated the failure cases and identified three major reasons for these failures. Note that a failure may be caused by multiple factors.
[0091] Error Type 1: Lack of Context (i.e., 50.8% of cases): Among the failure cases, most cases (i.e., 31) failed because certain expressions, e.g., “it”, “these”, “previous years”, referred to other data phrases. For example, with the sentence “These three countries comprised 89% of all cases reported in the region”, to compute the “89%”, the Connection Engine needed to know which countries “These three countries” referred to. In this example, the three countries were mentioned in previous sentences as independent phrases. This problem, however, can be addressed by employing co-reference resolution, i.e., finding expressions that refer to the same entity within or between sentences, which has been advanced in recent years. The Connection Engine can integrate co-reference resolutions models to connect data phrases in previous sentences to the present one, thereby maintaining the context to infer text-data connections. (See, e.g., K. Lee, et al., “End-to-end Neural Coreference Resolution”, In Proc. of ACL. ACL, 2017, pp. 188-197, incorporated herein by reference.)
[0092] Error Type 2: Expect Textual instead of Numerical Outputs (i.e., 27.9% of cases): Seventeen cases failed because the expected output was a text description rather than a number. For example, in “Two in five e-cigarette users reported usually paying for their own e-cigarettes”, the expected output was “Two in five” while the engine returned “43%”. To address this issue, the Connection Engine could generate more candidates with different formats, or adopt more advance generative language models, such as GPT-3, described by T.B. Brown, et al., in “Language Models are Few-Shot Learners” . In Proc. of 34.sup.th Conf. on Neural Information Processing Systems (NeurIPS 2020), incorporated herein by reference. Note that while the data formats of the suggested phrases do not match the ground truth, the underlying data operations inferred by the Connection Engine were correct. This means that the Connection Engine could accurately infer 91.9% of all data operations.
[0093] Error Type 3: Uncovered Operations (i.e., 21.3% of cases): Thirteen cases failed because the required data operations in the sentences were not covered by the 10 low-level data operations summarized by Amar et al. In the example “Cases have decreased steeply for the past four weeks”, computing the “four weeks” is a high-level analytical task (i.e., given a column and a text description of the trend, report the range of rows that fulfill the trend), which was not supported by the prototype system used in the evaluation. Considering the rule-based nature of the Connection Engine, these cases can be addressed by extending the predefined operation dictionary and corresponding rules.
[0094] To summarize, the performance evaluation showed that the Connection Engine was robust enough to achieve a high accuracy when generating dependent phrases about a set of real-world sentences collected from multiple domains. The in-depth analysis indicated that most of the failure cases could be corrected by extensions to the prototype engine used in the evaluation.
[0095] The CrossData™ system was developed as a technology tool to exploit the notion of language-oriented data bindings. It was recognized that the system might initially create usability problems for writers who are familiar with existing tools. To gain feedback about the effectiveness of our approach without being bogged down by the initial challenges some writers may encounter with usability, we conducted an expert evaluation study that focused on collecting experts’ feedback about the usefulness of each interaction technique and how language-oriented authoring could facilitate the overall workflow of authoring data documents.
[0096] Participants and Apparatus: Eight participants were recruited to participate in the study (E1 - E8, 5 female, age 28 - 31). The group included 1 auditor (accounting), 1 operation officer (internet services), 1 investment banking associate (financial services), 1 due diligence consultant (business services), 2 marketing managers (internet services and retail) and 2 researchers (data science and public healthy). E1-E5 participated in the formative study. All participants had more than 5 years of experience analyzing data and writing data documents as part of their daily work. The most used data processing and writing tools included Microsoft® Excel®, Google® Sheets®, Microsoft® Word®, Google® Docs®, and Tableau® (Tableau Software, LLC). The study was conducted remotely with CrossData™ implemented as a responsive Web application that participants could directly access from their personal computers. Video conferencing was used to communicate with participants, share screens, and record the study. Participants received $60 (USD) for the approximately 90-minute session.
[0097] Each evaluation session included four phases:
[0098] Introduction and Training (30 mins): The experimenter first introduced the study protocol, research motivation, and concepts of CrossData™. Then, the experimenter walked the participants through the system with an example that contained two datasets that were presented as a table and a bar chart, and five insights to report. Participants were encouraged to ask questions anytime during the process. Participants were then asked to replicate the example to become familiar with the system.
[0099] Reproduction Task (15 mins): Participants were asked to reproduce a given data document, which presented a USA COVID-19 dataset with a multiple line chart and six sentences, each of which reported an insight. The original datasets, a multiple line chart, and a choropleth map were provided as the context for the insights.
[0100] Creation Task (20 mins): Participants were asked to write a short document to report on three datasets about Global COVID-19 cases. Each dataset included one data representation (i.e., a chart or a table) and three insights. The short document needed to contain at least one insight from each dataset, and one data representation. To simulate realistic iterative processes, after the participants finished the document, the experimenter asked them to iterate on the document by 1) reporting two more insights, 2) inserting one more chart or table, and 3) changing the data phrases or operators in the documents. The changes to the data phrases or operators were selected to ensure that the participants experienced all of the proposed interaction techniques.
[0101] Semi-structured Interview and Questionnaire (25 mins): After the creation task, participants completed a questionnaire that probed the usefulness and usability of the techniques using a 5-point Likert scale (i.e., 1 - Strongly Disagree, 5 - Strongly Agree). Then, the experimenter conducted a semi-structured interview to further collect feedback about the utility of each interaction technique, CrossData™’s effectiveness in supporting realistic workflows, limitations of the proposed techniques, and potential improvements.
[0102] Results: All participants successfully finished the reproduction and creation tasks. On average, each participant wrote 12.6 sentences and 123.3 words, which contained 22.1 independent and 13.6 dependent data phrases. All participants experienced all the proposed interaction techniques.
[0103] The following discusses how the proposed interactions: 1) addressed the issues identified in the formative study: 2) could improve participants’ current authoring workflows; and 3) could be extended for data exploration and to enable new workflows that bridge the gap between the writing and data exploration stages. Also discussed are We also report on observed behaviors that suggested future improvements for real-world usage.
[0104] Utility of Text-Data Connections: Referring to
[0105] Participants also responded positively (4/8 strongly agree, 4/8 agree) to the techniques designed to maintain consistency between data and text. These techniques helped users “ensure consistency” with “fewer manual efforts”. One commenter offered that these techniques could help her company “reduce human resource costs on the review team”.
[0106] The interactive techniques that facilitated iteration via interaction with data-driven text (5/8 strongly agree, 3/8 agree) and the automatic adjustments of tables (5/8 strongly agree, 3/8 agree) and charts (5/8 strongly agree, 3/8 agree) were also praised by participants because these techniques could “significantly reduce working back-and-forth” and enabled participants to “rapidly refine the charts [and tables].” Several participants remarked that the interactivity of the text, as well as the real-time synchronization between text, table, and charts, made the authoring process “fun and engaging”, but also could assist in thought processes and inspire more ideas during writing as the user can “see what he is writing”.
[0107] Authoring Workflow vs Traditional Tools: All participants agreed that the interactions provided by CrossData™ would mesh well with their current workflows (4/8 strongly agree, 4/8 agree), e.g., “you just need to write as usual.”. They further commented that these interaction techniques did not require installing another application and could be easily integrated within existing tools by “installing [them as] a plugin to my Word”.
[0108] All participants found that the interaction techniques could streamline their workflows due to “less context switching” and allow for efficient iterations of a document. A commenter noted that she used to frequently switch between “Excel, Word, and sometimes the calculator” during the writing process, which was “stressful and distracting.” By integrating CrossData™ with the existing tools, the participant could “concentrate on her writing”, and “focus on the current writing without worrying about refining or updating other sentences.”
[0109] Another improvement to participants’ workflows that was mentioned was “facilitating the process of getting feedback from others.” Mainstream tools such as Word® and PowerPoint® present reports in a static manner and thus hinder authors from addressing or responding to others’ feedback immediately, whereas the features provided by CrossData™ “make it very useful to answer ad-hoc questions during the discussions that would normally require some follow up work, e.g., swap out regions, look at percentage changes between different time periods, etc.”
[0110] In terms of the negative impacts these techniques may have on their workflows, one person noted that “perhaps the only cost is to learn how to use [them]”. Specifically, “you need to understand the concepts and get familiar with, for example, placeholders”. Nevertheless, as reflected in the results shown in
[0111] Enabling New Workflows to Bridge the Gap between Data Exploration and Writing: While CrossData™ was designed to support the writing stage, the intertwined nature of exploration and writing inspired participants to imagine CrossData™ beyond the presented tasks. Several additional benefits were suggested that could be enabled by the language-oriented techniques to facilitate data analysis and exploration.
[0112] First, natural language allows expression of reusable high-level goals instead of performing transient low-level operations, thereby improving the efficiency of data exploration. One commenter noted that with the compute value technique provided by CrossData™, he could efficiently calculate a value by typing a sentence instead of having to “scroll up and down in a sheet and brush and re-brush the cells.” Moreover, he suggested that the exploration process could be easily reused for different data by copying and pasting the text, i.e., “I can write text to retrieve and calculate values, and then copy the text to another sheet to get new values ... this is impossible in Excel since I cannot copy my interactions on one sheet to another.”
[0113] Second, CrossData™ could facilitate active thinking during the exploration process. One participant found that the suggestion list and interactive operators inspired them to explore the data from new perspectives that had not been recognized previously. They remarked that the suggested text was similar to the query recommendations in search engines. Another commenter explained that sometimes they stopped data exploration because it required too many tedious operations with Excel, i.e., “exploration is a process of thinking rather operating the Excel. . . I will definitely explore more if only a few clicks or types are required.”
[0114] Third, language-oriented data exploration enabled users to “record their exploration process as [a] draft” and naturally “shift from data exploration to writing.” All participants confirmed that there was a gap between data exploration and presentation in their current workflow, which has been recognized in prior work as an important research direction to improve the workflow of data analysts. One commenter noted that these “two interconnected stages [i.e., data exploration and communication,] were usually separated in two disconnected applications.” With language-oriented interaction techniques, however, data exploration and data document authoring can be tightly integrated such that “exploring [the data] is drafting [the document] and vice versa.”
[0115] Several interesting behaviors were observed that reflected participants’ real-world writing practices that were not supported by the prototype system.
[0116] First, when the data operations were simple, participants tended to directly type the result, which could result in untracked connections. For example, when writing “The U.S. reports the most new cases in America”, one participant manually typed “The U.S.” instead of using the placeholder feature. This was because that the participant already knew the desired data, and inserting a placeholder required more effort. The result, however, was that “The U.S.” text would not be updated when the participant was asked to modify “America” to “Africa”, resulting in data inconsistency due to the missing connections. While the Connection Engine is currently designed to interactively recommend data phrases, to address this issue, it could be extended to detect and connect manually typed dependent phrases to ensure that all data phrases would be connected with the underlying dataset.
[0117] We also observed that some participants reported approximate numbers instead of exact data values, which caused undesired suggestions from the engine. For example, one participant wrote that “[Placeholder] countries in America report more than 10,000 ...” He wanted to connect “10,000” with the new cases column. However, because “10,000” is an approximate number that did not exist in the new cases column, the Connection Engine could not return suggestions because it relies on string and synonym matching to suggest independent phrases. The writer then struggled to connect the “10,000” with the new cases column. Such behavior was also observed in other participants. While participants altered the approximate numbers to exact values to create connections, this issue could be common in real-world scenarios. To address this, CrossData™ could be extended to allow users to manually insert their desired connections or support fuzzy data value matching when certain keys are present, such as “almost” and “more than”.
[0118] Third, the participants tended to write safe, simple sentences to ensure the connections would be created successfully during writing. Overall, the sentences were relatively simple and had similar structures to the sentences in the training and reproduction tasks. While this could be attributed to the limited time frame of the task, it is possible that participants faced a dilemma when guessing which written text the system could understand and use to establish connections. Such an issue has been recognized as a long-standing challenge for users of NLI systems. To address this issue, the system could provide alternative methods (e.g., interface actions) to allow users to manually create text-data connections instead of fully relying on the auto-extraction of connections from the text. Several participants confirmed this improvement would be useful and necessary in their interviews, indicating that “the system should enable users to create or modify the connections after the writing.”
[0119] Participants noted some limitations of the CrossData™ system and suggested some improvements. Similar to other interactive systems that employ NLP, CrossData™ can misinterpret users’ intentions for the reasons discussed above relating to failure case analyses and observed behavior (e.g., lack of context, unrecognized approximate numbers). While CrossData™ allows users to correct misdetections caused by predefined rules, it does not support the correction of errors caused by NLP techniques. All participants expressed their concern regarding this and understood that they could be mitigated by further advancements of NLP techniques, more intelligent connection recognition algorithms, and by including the ability to flexibly modify the suggested connections.
[0120] Participants also proposed improvements relating to extensibility and customizability. For example, CrossData™ could support customized operators and calculations or enable users to import domain-specific operators from online libraries. Also, CrossData™ should enable users to share their customized operators with others to facilitate collaborative editing. In addition, the system should enable users to “freeze” connections so that they could rephrase sentences without worrying about losing any connections.
[0121] Several participants also raised concerns about scalability. For instance, an auditor, who often needed to write data documents to synthesize findings from more than 50 datasets, noted that connecting a phrase to all underlying datasets could lead to too many possible connections. A potential solution to this could be to add a context-awareness mechanism to CrossData™ so that it could prune the search space based on one’s writing context, e.g., the surrounding sentences, tables, charts, and section titles.
[0122] The examples described herein are directed to the connection of text to tabular data, wherein each data item is represented as a row and its attributes are represented as columns. While tabular data is common in practice, it does not naturally contain information about the rich relationships that exist among data items arranged within graph-based or tree-based data structures. Using a similar approach to the table formats, connections can be formed between text and rich data structures. The data visualizations currently supported within CrossData™ are basic charts (e.g., line and bar charts), however, a similar approach can be extended to support customized, complex data visualizations. This requires the identification of mappings between the natural human language used in data documents and the domain-specific terms used during data analysis and visualization processes. To develop such mappings, existing data documents can be annotated to describe or contain various data structures and visualizations.
[0123] Expanding the scope of a “document” beyond its conventional definition, the act of creating a work of authorship can be extended to programming for data analysis and visualization. Beyond graphical user interface applications, programming is another commonly used modality for data analysis and visualization. For example, computational notebook applications, which enable users to write programs to analyze and visualize data, are becoming increasingly popular. A common practice when using computational notebooks is to write explanatory textual descriptions alongside a program’s code to facilitate documentation and collaboration. This presents an opportunity to extend the use of written text for data analysis and visualization. Thus, one future direction could be to integrate CrossData™ into computational notebooks, so that users can analyze and visualize data by writing descriptive and self-explanatory text without requiring programming skills.
[0124] While CrossData™ leverages text-data connections to support the authoring of static data documents, the resulting data documents were interactive, suggesting opportunities to create interactive documents without any programming. The CrossData™ system can be expanded to support the creation of data-driven diagrams and simulations. Similarly, other forms of dynamic and interactive presentations of data can be created with text-data connections, such as data videos and data animations. For example, the connections between text with tables and charts can be directly employed to insert animated changes in tables and charts that correspond with the narration of animation, videos, or slideshows.
[0125]
[0126] Memory subsystem 1212 includes one or more devices for storing data and/or instructions for processing subsystem 1210 and networking subsystem 1214. For example, memory subsystem 1212 can include dynamic random access memory (DRAM), static random access memory (SRAM), and/or other types of memory. In some embodiments, instructions for processing subsystem 1210 in memory subsystem 1212 include: program instructions or sets of instructions (such as program instructions 1222 or operating system 1224), which may be executed by processing subsystem 1210. Note that one or more computer programs or program instructions may constitute a computer-program mechanism. Instructions in the various program instructions in memory subsystem 1212 may be implemented in: a high-level procedural language, an object-oriented programming language, and/or in an assembly or machine language. Furthermore, the programming language may be compiled or interpreted, e.g., configurable or configured (which may be used interchangeably in this discussion), to be executed by processing subsystem 1210.
[0127] In addition, memory subsystem 1212 can include mechanisms for controlling access to the memory. In some embodiments, memory subsystem 1212 includes a memory hierarchy that comprises one or more caches coupled to a memory in computer 1200. In some of these embodiments, one or more of the caches is located in processing subsystem 1210.
[0128] In some embodiments, memory subsystem 1212 is coupled to one or more high-capacity mass-storage devices (not shown). For example, memory subsystem 1212 can be coupled to a magnetic or optical drive, a solid-state drive, or another type of mass-storage device. In these embodiments, memory subsystem 1212 can be used by computer 1200 as fast-access storage for often-used data, while the mass-storage device is used to store less frequently used data.
[0129] Networking subsystem 1214 includes one or more devices configured to couple to and communicate on a wired and/or wireless network (i.e., to perform network operations), including: control logic 1216, an interface circuit 1218 and one or more antennas 1220 (or antenna elements). (While
[0130] Networking subsystem 1214 includes processors, controllers, radios/antennas, sockets/plugs, and/or other devices used for coupling to, communicating on, and handling data and events for each supported networking system. Note that mechanisms used for coupling to, communicating on, and handling data and events on the network for each network system are sometimes collectively referred to as a ‘network interface’ for the network system. Computer 1200 may use the mechanisms in networking subsystem 1214 for performing simple wireless communication between electronic devices, e.g., transmitting advertising or beacon frames and/or scanning for advertising frames transmitted by other electronic devices.
[0131] Within computer 1200, processing subsystem 1210, memory subsystem 1212, and networking subsystem 1214 are coupled together using bus 1228. Bus 1228 may include an electrical, optical, and/or electro-optical connection that the subsystems can use to communicate commands and data among one another. Although only one bus 1228 is shown for clarity, different embodiments can include a different number or configuration of electrical, optical, and/or electro-optical connections among the subsystems.
[0132] In some embodiments, computer 1200 includes a display subsystem 1226 for displaying information on a display, which may include a display driver and the display, such as a liquid-crystal display, a multi-touch touchscreen, etc. Further, computer 1200 may include a user-interface subsystem 1230, such as: a mouse, a keyboard, a trackpad, a stylus, a voice-recognition interface, and/or another human-machine interface.
[0133] Computer 1200 can be (or can be included in) any electronic device with at least one network interface. For example, computer 1200 can be (or can be included in): a desktop computer, a laptop computer, a subnotebook/netbook, a server, a supercomputer, a tablet computer, a smartphone, a cellular telephone, a consumer-electronic device, a portable computing device, communication equipment, and/or another electronic device.
[0134] Although specific components are used to describe computer 1200, in alternative embodiments, different components and/or subsystems may be present in computer 1200. For example, computer 1200 may include one or more additional processing subsystems, memory subsystems, networking subsystems, and/or display subsystems. Additionally, one or more of the subsystems may not be present in computer 1200. In some embodiments, computer 1200 may include one or more additional subsystems that are not shown in
[0135] The foregoing description is intended to enable any person skilled in the art to make and use the disclosure and is provided in the context of a particular application and its requirements. Further, the foregoing descriptions of embodiments of the present disclosure have been presented for purposes of illustration and description only. They are not intended to be exhaustive or to limit the present disclosure to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Additionally, the discussion of the preceding embodiments is not intended to limit the present disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.