Systems and Methods for Automatic URL Identification From Data
20230161831 · 2023-05-25
Assignee
Inventors
- Swapnil Singh (Bellevue, WA, US)
- María Dolores Guerrero (Seattle, WA, US)
- Vasudev Daruvuri (Bothell, WA, US)
Cpc classification
G06F7/08
PHYSICS
G06F16/9566
PHYSICS
International classification
G06F16/955
PHYSICS
G06F7/08
PHYSICS
G06F16/2458
PHYSICS
Abstract
Systems and methods for automatic URL identification from data are provided. The system receives and processes one or more sources of data, such as merchant data, and processes the input data to identify one or more URLs present in the data. The identified URLs are automatically validated by the system using one or more fuzzy and/or exact matching algorithms. The validation could be performed by matching one or more non-URL data items, such as a business name, address, e-mail, country, zip code, or any other suitable non-URL data item, to ensure that only valid URLs are identified. Once the URLs are validated, a report is generated by the system.
Claims
1. A method for automatic identification of a Uniform Resource Locator (URL) from data, comprising the steps of: receiving a data item at a processor; processing the data idem to identify at least one URL present in the data item; processing the at least one URL using at least one of a fuzzy matching algorithm or an exact matching algorithm to validate the at least one URL; and if the at least one URL is validated by the fuzzy matching algorithm or the exact matching algorithm, generating and transmitting an output file that includes the at least one URL.
2. The method of claim 1, further comprising matching one or more non-URL data items to ensure that only valid URLs are identified.
3. The method of claim 1, further comprising pre-processing the data item to perform at least one of sorting the data item according to a client-level sorting or standardizing columns in the data item.
4. The method of claim 1, further comprising identifying merchant metadata associated with the data item and validating the merchant metadata against one or more IP addresses.
5. The method of claim 4, further comprising cross-checking the at least one URL based on one or more of a business name or an e-mail address.
6. The method of claim 1, further comprising validating the at least one URL utilizing at least one matching criteria including one or more of a business name, a merchant name, a doing business as (DBA) name, a transacting business as (T/A) name, a street address, a city, a postal code, a state, a province, a telephone number, an e-mail address, a name, a business description, or a county.
7. The method of claim 1, further comprising scoring the at least one URL to determine relevancy of the at least one URL to metadata associated with a merchant.
8. A system for automatic identification of a Uniform Resource Locator (URL) from data, comprising: a database storing at least one data item; and a processor in communication with the database, the processor programmed to perform the steps of: receiving the data item; processing the data idem to identify at least one URL present in the data item; processing the at least one URL using at least one of a fuzzy matching algorithm or an exact matching algorithm to validate the at least one URL; and if the at least one URL is validated by the fuzzy matching algorithm or the exact matching algorithm, generating and transmitting an output file that includes the at least one URL.
9. The system of claim 8, wherein the processor is programmed to perform the step of matching one or more non-URL data items to ensure that only valid URLs are identified.
10. The system of claim 8, wherein the processor is programmed to perform the step of pre-processing the data item to perform at least one of sorting the data item according to a client-level sorting or standardizing columns in the data item.
11. The system of claim 8, wherein the processor is programmed to perform the step of identifying merchant metadata associated with the data item and validating the merchant metadata against one or more IP addresses.
12. The system of claim 11, wherein the processor is programmed to perform the step of cross-checking the at least one URL based on one or more of a business name or an e-mail address.
13. The system of claim 8, wherein the processor is programmed to perform the step of validating the at least one URL utilizing at least one matching criteria including one or more of a business name, a merchant name, a doing business as (DBA) name, a transacting business as (T/A) name, a street address, a city, a postal code, a state, a province, a telephone number, an e-mail address, a name, a business description, or a county.
14. The system of claim 8, wherein the processor is programmed to perform the step of scoring the at least one URL to determine relevancy of the at least one URL to metadata associated with a merchant.
15. A non-transitory, computer-readable medium having computer-readable instructions stored thereon which, when executed by a processor, causes the processor to perform the steps of: receiving a data item at the processor; processing the data idem to identify at least one URL present in the data item; processing the at least one URL using at least one of a fuzzy matching algorithm or an exact matching algorithm to validate the at least one URL; and if the at least one URL is validated by the fuzzy matching algorithm or the exact matching algorithm, generating and transmitting an output file that includes the at least one URL.
16. The computer-readable medium of claim 15, further comprising instructions for causing the processor to perform the step of matching one or more non-URL data items to ensure that only valid URLs are identified.
17. The computer-readable medium of claim 15, further comprising instructions for causing the processor to perform the step of pre-processing the data item to perform at least one of sorting the data item according to a client-level sorting or standardizing columns in the data item.
18. The computer-readable medium of claim 15, further comprising instructions for causing the processor to perform the step of identifying merchant metadata associated with the data item and validating the merchant metadata against one or more IP addresses.
19. The computer-readable medium of claim 18, further comprising instructions for causing the processor to perform the step of cross-checking the at least one URL based on one or more of a business name or an e-mail address.
20. The computer-readable medium of claim 15, further comprising instructions for causing the processor to perform the step of validating the at least one URL utilizing at least one matching criteria including one or more of a business name, a merchant name, a doing business as (DBA) name, a transacting business as (T/A) name, a street address, a city, a postal code, a state, a province, a telephone number, an e-mail address, a name, a business description, or a county.
21. The computer-readable of claim 15, further comprising instructions for causing the processor to perform the step of scoring the at least one URL to determine relevancy of the at least one URL to metadata associated with a merchant.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] The foregoing features of the invention will be apparent from the following Detailed Description, taken in connection with the accompanying drawings, in which:
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
DETAILED DESCRIPTION
[0014] The present disclosure relates to systems and methods for automatic URL identification from data, as discussed in detail below in connection with
[0015]
[0016]
[0017] In step 40, the system “scrapes” each URL content including the URL's main page content and the “Contact US” page content in the list (one by one). In step 48, the system validates the incoming metadata (which could be standardized) against URL content using one or more matching algorithms, which could apply fuzzy (approximate) or exact matching processes to the URLs. The matching process follows the logic of
[0018] If the URLs are successfully validated, step 54 occurs, wherein the system appends the URLs to the input data obtained in step 32. This could be performed by appending the URLs to columns of data in the input data. Finally, in step 56, the system generates and transmits an output file which includes the URLs and the input data.
[0019]
[0020] The review cycle 78 includes a quality assurance (QA) review process 80, wherein the system reviews and confirms the accuracy of the URLs returned by the system. In step 82, the system also optionally allows for a manual review process, wherein one or more users of the system can manually review and confirm the accuracy of URLs returned by the system. In step 84, once the QA review process 80 is complete, the system generates and delivers a report that includes and summarizes all of the URLs returned by the system. Optionally, in step 80, the system can determine whether an e-commerce platform is in communication with the system of the present disclosure.
[0021] As shown in the workflow 72 of
[0022]
[0023]
[0024]
[0025] Having thus described the system and method in detail, it Is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art may make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure. What is desired to be protected by Letters Patent is set forth in the following claims.