diff --git a/language/classify_text/README.rst b/language/classify_text/README.rst new file mode 100644 index 00000000000..0a61591bc22 --- /dev/null +++ b/language/classify_text/README.rst @@ -0,0 +1,126 @@ +.. This file is automatically generated. Do not edit this file directly. + +Google Cloud Natural Language API Python Samples +=============================================================================== + +This directory contains samples for Google Cloud Natural Language API. The `Google Cloud Natural Language API`_ provides natural language understanding technologies to developers. + +This tutorial demostrates how to use the `classify_text` method to classify content category of text files, and use the result to compare texts by their similarity to each other. See the `tutorial page`_ for details about this sample. + +.. _tutorial page: https://cloud.google.com/natural-language/docs/classify-text-tutorial + + + + +.. _Google Cloud Natural Language API: https://cloud.google.com/natural-language/docs/ + +Setup +------------------------------------------------------------------------------- + + +Authentication +++++++++++++++ + +Authentication is typically done through `Application Default Credentials`_, +which means you do not have to change the code to authenticate as long as +your environment has credentials. You have a few options for setting up +authentication: + +#. When running locally, use the `Google Cloud SDK`_ + + .. code-block:: bash + + gcloud auth application-default login + + +#. When running on App Engine or Compute Engine, credentials are already + set-up. However, you may need to configure your Compute Engine instance + with `additional scopes`_. + +#. You can create a `Service Account key file`_. This file can be used to + authenticate to Google Cloud Platform services from any environment. To use + the file, set the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable to + the path to the key file, for example: + + .. code-block:: bash + + export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account.json + +.. _Application Default Credentials: https://cloud.google.com/docs/authentication#getting_credentials_for_server-centric_flow +.. _additional scopes: https://cloud.google.com/compute/docs/authentication#using +.. _Service Account key file: https://developers.google.com/identity/protocols/OAuth2ServiceAccount#creatinganaccount + +Install Dependencies +++++++++++++++++++++ + +#. Install `pip`_ and `virtualenv`_ if you do not already have them. + +#. Create a virtualenv. Samples are compatible with Python 2.7 and 3.4+. + + .. code-block:: bash + + $ virtualenv env + $ source env/bin/activate + +#. Install the dependencies needed to run the samples. + + .. code-block:: bash + + $ pip install -r requirements.txt + +.. _pip: https://pip.pypa.io/ +.. _virtualenv: https://virtualenv.pypa.io/ + +Samples +------------------------------------------------------------------------------- + +Classify Text Tutorial ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + + + +To run this sample: + +.. code-block:: bash + + $ python classify_text_tutorial.py + + usage: classify_text_tutorial.py [-h] + {classify,index,query,query-category} ... + + Using the classify_text method to cluster texts. + + positional arguments: + {classify,index,query,query-category} + classify Classify the input text into categories. + index Classify each text file in a directory and write the + results to the index_file. + query Find the indexed files that are the most similar to + the query text. + query-category Find the indexed files that are the most similar to + the query label. The list of all available labels: + https://cloud.google.com/natural- + language/docs/categories + + optional arguments: + -h, --help show this help message and exit + + + + +The client library +------------------------------------------------------------------------------- + +This sample uses the `Google Cloud Client Library for Python`_. +You can read the documentation for more details on API usage and use GitHub +to `browse the source`_ and `report issues`_. + +.. _Google Cloud Client Library for Python: + https://googlecloudplatform.github.io/google-cloud-python/ +.. _browse the source: + https://github.com/GoogleCloudPlatform/google-cloud-python +.. _report issues: + https://github.com/GoogleCloudPlatform/google-cloud-python/issues + + +.. _Google Cloud SDK: https://cloud.google.com/sdk/ \ No newline at end of file diff --git a/language/classify_text/README.rst.in b/language/classify_text/README.rst.in new file mode 100644 index 00000000000..42e8f061a5d --- /dev/null +++ b/language/classify_text/README.rst.in @@ -0,0 +1,26 @@ +# This file is used to generate README.rst + +product: + name: Google Cloud Natural Language API + short_name: Cloud Natural Language API + url: https://cloud.google.com/natural-language/docs/ + description: > + The `Google Cloud Natural Language API`_ provides natural language + understanding technologies to developers. + + + This tutorial demostrates how to use the `classify_text` method to classify content category of text files, and use the result to compare texts by their similarity to each other. See the `tutorial page`_ for details about this sample. + + + .. _tutorial page: https://cloud.google.com/natural-language/docs/classify-text-tutorial + +setup: +- auth +- install_deps + +samples: +- name: Classify Text Tutorial + file: classify_text_tutorial.py + show_help: true + +cloud_client_library: true diff --git a/language/classify_text/classify_text_tutorial.py b/language/classify_text/classify_text_tutorial.py new file mode 100644 index 00000000000..08a03e98212 --- /dev/null +++ b/language/classify_text/classify_text_tutorial.py @@ -0,0 +1,261 @@ +#!/usr/bin/env python + +# Copyright 2017, Google, Inc. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# [START classify_text_tutorial] +"""Using the classify_text method to find content categories of text files, +Then use the content category labels to compare text similarity. + +For more information, see the tutorial page at +https://cloud.google.com/natural-language/docs/classify-text-tutorial. +""" + +# [START classify_text_tutorial_import] +import argparse +import io +import json +import os + +from google.cloud import language_v1beta2 +from google.cloud.language_v1beta2 import enums +from google.cloud.language_v1beta2 import types + +import numpy +import six +# [END classify_text_tutorial_import] + + +# [START def_classify] +def classify(text, verbose=True): + """Classify the input text into categories. """ + + language_client = language_v1beta2.LanguageServiceClient() + + document = types.Document( + content=text, + type=enums.Document.Type.PLAIN_TEXT) + response = language_client.classify_text(document) + categories = response.categories + + result = {} + + for category in categories: + # Turn the categories into a dictionary of the form: + # {category.name: category.confidence}, so that they can + # be treated as a sparse vector. + result[category.name] = category.confidence + + if verbose: + print(text) + for category in categories: + print(u'=' * 20) + print(u'{:<16}: {}'.format('category', category.name)) + print(u'{:<16}: {}'.format('confidence', category.confidence)) + + return result +# [END def_classify] + + +# [START def_index] +def index(path, index_file): + """Classify each text file in a directory and write + the results to the index_file. + """ + + result = {} + for filename in os.listdir(path): + file_path = os.path.join(path, filename) + + if not os.path.isfile(file_path): + continue + + try: + with io.open(file_path, 'r') as f: + text = f.read() + categories = classify(text, verbose=False) + + result[filename] = categories + except: + print('Failed to process {}'.format(file_path)) + + with io.open(index_file, 'w') as f: + f.write(unicode(json.dumps(result))) + + print('Texts indexed in file: {}'.format(index_file)) + return result +# [END def_index] + + +# [START def_split_labels] +def split_labels(categories): + """The category labels are of the form "/a/b/c" up to three levels, + for example "/Computers & Electronics/Software", and these labels + are used as keys in the categories dictionary, whose values are + confidence scores. + + The split_labels function splits the keys into individual levels + while duplicating the confidence score, which allows a natural + boost in how we calculate similarity when more levels are in common. + + Example: + If we have + + x = {"/a/b/c": 0.5} + y = {"/a/b": 0.5} + z = {"/a": 0.5} + + Then x and y are considered more similar than y and z. + """ + _categories = {} + for name, confidence in six.iteritems(categories): + labels = [label for label in name.split('/') if label] + for label in labels: + _categories[label] = confidence + + return _categories +# [END def_split_labels] + + +# [START def_similarity] +def similarity(categories1, categories2): + """Cosine similarity of the categories treated as sparse vectors.""" + categories1 = split_labels(categories1) + categories2 = split_labels(categories2) + + norm1 = numpy.linalg.norm(categories1.values()) + norm2 = numpy.linalg.norm(categories2.values()) + + # Return the smallest possible similarity if either categories is empty. + if norm1 == 0 or norm2 == 0: + return 0.0 + + # Compute the cosine similarity. + dot = 0.0 + for label, confidence in six.iteritems(categories1): + dot += confidence * categories2.get(label, 0.0) + + return dot / (norm1 * norm2) +# [END def_similarity] + + +# [START def_query] +def query(index_file, text, n_top=3): + """Find the indexed files that are the most similar to + the query text. + """ + + with io.open(index_file, 'r') as f: + index = json.load(f) + + # Get the categories of the query text. + query_categories = classify(text, verbose=False) + + similarities = [] + for filename, categories in six.iteritems(index): + similarities.append( + (filename, similarity(query_categories, categories))) + + similarities = sorted(similarities, key=lambda p: p[1], reverse=True) + + print('=' * 20) + print('Query: {}\n'.format(text)) + for category, confidence in six.iteritems(query_categories): + print('\tCategory: {}, confidence: {}'.format(category, confidence)) + print('\nMost similar {} indexed texts:'.format(n_top)) + for filename, sim in similarities[:n_top]: + print('\tFilename: {}'.format(filename)) + print('\tSimilarity: {}'.format(sim)) + print('\n') + + return similarities +# [END def_query] + + +# [START def_query_category] +def query_category(index_file, category_string, n_top=3): + """Find the indexed files that are the most similar to + the query label. + + The list of all available labels: + https://cloud.google.com/natural-language/docs/categories + """ + + with io.open(index_file, 'r') as f: + index = json.load(f) + + # Make the category_string into a dictionary so that it is + # of the same format as what we get by calling classify. + query_categories = {category_string: 1.0} + + similarities = [] + for filename, categories in six.iteritems(index): + similarities.append( + (filename, similarity(query_categories, categories))) + + similarities = sorted(similarities, key=lambda p: p[1], reverse=True) + + print('=' * 20) + print('Query: {}\n'.format(category_string)) + print('\nMost similar {} indexed texts:'.format(n_top)) + for filename, sim in similarities[:n_top]: + print('\tFilename: {}'.format(filename)) + print('\tSimilarity: {}'.format(sim)) + print('\n') + + return similarities +# [END def_query_category] + + +if __name__ == '__main__': + parser = argparse.ArgumentParser( + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + subparsers = parser.add_subparsers(dest='command') + classify_parser = subparsers.add_parser( + 'classify', help=classify.__doc__) + classify_parser.add_argument( + 'text', help='The text to be classified. ' + 'The text needs to have at least 20 tokens.') + index_parser = subparsers.add_parser( + 'index', help=index.__doc__) + index_parser.add_argument( + 'path', help='The directory that contains ' + 'text files to be indexed.') + index_parser.add_argument( + '--index_file', help='Filename for the output JSON.', + default='index.json') + query_parser = subparsers.add_parser( + 'query', help=query.__doc__) + query_parser.add_argument( + 'index_file', help='Path to the index JSON file.') + query_parser.add_argument( + 'text', help='Query text.') + query_category_parser = subparsers.add_parser( + 'query-category', help=query_category.__doc__) + query_category_parser.add_argument( + 'index_file', help='Path to the index JSON file.') + query_category_parser.add_argument( + 'category', help='Query category.') + + args = parser.parse_args() + + if args.command == 'classify': + classify(args.text) + if args.command == 'index': + index(args.path, args.index_file) + if args.command == 'query': + query(args.index_file, args.text) + if args.command == 'query-category': + query_category(args.index_file, args.category) +# [END classify_text_tutorial] diff --git a/language/classify_text/classify_text_tutorial_test.py b/language/classify_text/classify_text_tutorial_test.py new file mode 100644 index 00000000000..305cf53fede --- /dev/null +++ b/language/classify_text/classify_text_tutorial_test.py @@ -0,0 +1,90 @@ +# Copyright 2016, Google, Inc. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import classify_text_tutorial +import pytest + + +OUTPUT = 'index.json' +RESOURCES = os.path.join(os.path.dirname(__file__), 'resources') +QUERY_TEXT = """Google Home enables users to speak voice commands to interact +with services through the Home\'s intelligent personal assistant called +Google Assistant. A large number of services, both in-house and third-party, +are integrated, allowing users to listen to music, look at videos or photos, +or receive news updates entirely by voice.""" +QUERY_CATEGORY = '/Computers & Electronics/Software' + + +@pytest.fixture(scope='session') +def index_file(tmpdir_factory): + temp_file = tmpdir_factory.mktemp('tmp').join(OUTPUT) + temp_out = temp_file.strpath + classify_text_tutorial.index(os.path.join(RESOURCES, 'texts'), temp_out) + return temp_file + + +def test_classify(capsys): + with open(os.path.join(RESOURCES, 'query_text1.txt'), 'r') as f: + text = f.read() + classify_text_tutorial.classify(text) + out, err = capsys.readouterr() + assert 'category' in out + + +def test_index(capsys, tmpdir): + temp_dir = tmpdir.mkdir('tmp') + temp_out = temp_dir.join(OUTPUT).strpath + + classify_text_tutorial.index(os.path.join(RESOURCES, 'texts'), temp_out) + out, err = capsys.readouterr() + + assert OUTPUT in out + assert len(temp_dir.listdir()) == 1 + + +def test_query_text(capsys, index_file): + temp_out = index_file.strpath + + classify_text_tutorial.query(temp_out, QUERY_TEXT) + out, err = capsys.readouterr() + + assert 'Filename: cloud_computing.txt' in out + + +def test_query_category(capsys, index_file): + temp_out = index_file.strpath + + classify_text_tutorial.query_category(temp_out, QUERY_CATEGORY) + out, err = capsys.readouterr() + + assert 'Filename: cloud_computing.txt' in out + + +def test_split_labels(): + categories = {'/a/b/c': 1.0} + split_categories = {'a': 1.0, 'b': 1.0, 'c': 1.0} + assert classify_text_tutorial.split_labels(categories) == split_categories + + +def test_similarity(): + empty_categories = {} + categories1 = {'/a/b/c': 1.0, '/d/e': 1.0} + categories2 = {'/a/b': 1.0} + + assert classify_text_tutorial.similarity( + empty_categories, categories1) == 0.0 + assert classify_text_tutorial.similarity(categories1, categories1) > 0.99 + assert classify_text_tutorial.similarity(categories1, categories2) > 0 + assert classify_text_tutorial.similarity(categories1, categories2) < 1 diff --git a/language/classify_text/requirements.txt b/language/classify_text/requirements.txt new file mode 100644 index 00000000000..10069f1801e --- /dev/null +++ b/language/classify_text/requirements.txt @@ -0,0 +1,2 @@ +google-cloud-language==0.29.0 +numpy==1.13.1 diff --git a/language/classify_text/resources/query_text1.txt b/language/classify_text/resources/query_text1.txt new file mode 100644 index 00000000000..304727304d1 --- /dev/null +++ b/language/classify_text/resources/query_text1.txt @@ -0,0 +1 @@ +Google Home enables users to speak voice commands to interact with services through the Home's intelligent personal assistant called Google Assistant. A large number of services, both in-house and third-party, are integrated, allowing users to listen to music, look at videos or photos, or receive news updates entirely by voice. diff --git a/language/classify_text/resources/query_text2.txt b/language/classify_text/resources/query_text2.txt new file mode 100644 index 00000000000..eef573c6007 --- /dev/null +++ b/language/classify_text/resources/query_text2.txt @@ -0,0 +1 @@ +The Hitchhiker's Guide to the Galaxy is the first of five books in the Hitchhiker's Guide to the Galaxy comedy science fiction "trilogy" by Douglas Adams (with the sixth written by Eoin Colfer). \ No newline at end of file diff --git a/language/classify_text/resources/query_text3.txt b/language/classify_text/resources/query_text3.txt new file mode 100644 index 00000000000..1337d3c6477 --- /dev/null +++ b/language/classify_text/resources/query_text3.txt @@ -0,0 +1 @@ +Goodnight Moon is an American children's picture book written by Margaret Wise Brown and illustrated by Clement Hurd. It was published on September 3, 1947, and is a highly acclaimed example of a bedtime story. \ No newline at end of file diff --git a/language/classify_text/resources/texts/android.txt b/language/classify_text/resources/texts/android.txt new file mode 100644 index 00000000000..29dc1449c55 --- /dev/null +++ b/language/classify_text/resources/texts/android.txt @@ -0,0 +1 @@ +Android is a mobile operating system developed by Google, based on the Linux kernel and designed primarily for touchscreen mobile devices such as smartphones and tablets. diff --git a/language/classify_text/resources/texts/cat_in_the_hat.txt b/language/classify_text/resources/texts/cat_in_the_hat.txt new file mode 100644 index 00000000000..bb5a853c694 --- /dev/null +++ b/language/classify_text/resources/texts/cat_in_the_hat.txt @@ -0,0 +1 @@ +The Cat in the Hat is a children's book written and illustrated by Theodor Geisel under the pen name Dr. Seuss and first published in 1957. The story centers on a tall anthropomorphic cat, who wears a red and white-striped hat and a red bow tie. \ No newline at end of file diff --git a/language/classify_text/resources/texts/cloud_computing.txt b/language/classify_text/resources/texts/cloud_computing.txt new file mode 100644 index 00000000000..88172adf1f4 --- /dev/null +++ b/language/classify_text/resources/texts/cloud_computing.txt @@ -0,0 +1 @@ +Cloud computing is a computing-infrastructure and software model for enabling ubiquitous access to shared pools of configurable resources (such as computer networks, servers, storage, applications and services), which can be rapidly provisioned with minimal management effort, often over the Internet. \ No newline at end of file diff --git a/language/classify_text/resources/texts/eclipse.txt b/language/classify_text/resources/texts/eclipse.txt new file mode 100644 index 00000000000..5d16217e520 --- /dev/null +++ b/language/classify_text/resources/texts/eclipse.txt @@ -0,0 +1 @@ +A solar eclipse (as seen from the planet Earth) is a type of eclipse that occurs when the Moon passes between the Sun and Earth, and when the Moon fully or partially blocks (occults) the Sun. diff --git a/language/classify_text/resources/texts/eclipse_of_the_sun.txt b/language/classify_text/resources/texts/eclipse_of_the_sun.txt new file mode 100644 index 00000000000..7236fc9d806 --- /dev/null +++ b/language/classify_text/resources/texts/eclipse_of_the_sun.txt @@ -0,0 +1 @@ +Eclipse of the Sun is the debut novel by English author Phil Whitaker. It won the 1997 John Llewellyn Rhys Prize a Betty Trask Award in 1998, and was shortlisted for the 1997 Whitbread First Novel Award. diff --git a/language/classify_text/resources/texts/email.txt b/language/classify_text/resources/texts/email.txt new file mode 100644 index 00000000000..3d430527b75 --- /dev/null +++ b/language/classify_text/resources/texts/email.txt @@ -0,0 +1 @@ +Electronic mail (email or e-mail) is a method of exchanging messages between people using electronics. Email first entered substantial use in the 1960s and by the mid-1970s had taken the form now recognized as email. \ No newline at end of file diff --git a/language/classify_text/resources/texts/gcp.txt b/language/classify_text/resources/texts/gcp.txt new file mode 100644 index 00000000000..1ed09b2c758 --- /dev/null +++ b/language/classify_text/resources/texts/gcp.txt @@ -0,0 +1 @@ +Google Cloud Platform, offered by Google, is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search and YouTube. Alongside a set of management tools, it provides a series of modular cloud services including computing, data storage, data analytics and machine learning. diff --git a/language/classify_text/resources/texts/gmail.txt b/language/classify_text/resources/texts/gmail.txt new file mode 100644 index 00000000000..89c9704b117 --- /dev/null +++ b/language/classify_text/resources/texts/gmail.txt @@ -0,0 +1 @@ +Gmail is a free, advertising-supported email service developed by Google. Users can access Gmail on the web and through mobile apps for Android and iOS, as well as through third-party programs that synchronize email content through POP or IMAP protocols. \ No newline at end of file diff --git a/language/classify_text/resources/texts/google.txt b/language/classify_text/resources/texts/google.txt new file mode 100644 index 00000000000..06828635931 --- /dev/null +++ b/language/classify_text/resources/texts/google.txt @@ -0,0 +1 @@ +Google is an American multinational technology company that specializes in Internet-related services and products. These include online advertising technologies, search, cloud computing, software, and hardware. diff --git a/language/classify_text/resources/texts/harry_potter.txt b/language/classify_text/resources/texts/harry_potter.txt new file mode 100644 index 00000000000..339c10af05a --- /dev/null +++ b/language/classify_text/resources/texts/harry_potter.txt @@ -0,0 +1 @@ +Harry Potter is a series of fantasy novels written by British author J. K. Rowling. The novels chronicle the life of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. \ No newline at end of file diff --git a/language/classify_text/resources/texts/matilda.txt b/language/classify_text/resources/texts/matilda.txt new file mode 100644 index 00000000000..e1539d7ee88 --- /dev/null +++ b/language/classify_text/resources/texts/matilda.txt @@ -0,0 +1 @@ +Matilda is a book by British writer Roald Dahl. Matilda won the Children's Book Award in 1999. It was published in 1988 by Jonathan Cape in London, with 232 pages and illustrations by Quentin Blake. \ No newline at end of file diff --git a/language/classify_text/resources/texts/mobile_phone.txt b/language/classify_text/resources/texts/mobile_phone.txt new file mode 100644 index 00000000000..725e22ef3a9 --- /dev/null +++ b/language/classify_text/resources/texts/mobile_phone.txt @@ -0,0 +1 @@ +A mobile phone is a portable device that can make and receive calls over a radio frequency link while the user is moving within a telephone service area. The radio frequency link establishes a connection to the switching systems of a mobile phone operator, which provides access to the public switched telephone network (PSTN). \ No newline at end of file diff --git a/language/classify_text/resources/texts/mr_fox.txt b/language/classify_text/resources/texts/mr_fox.txt new file mode 100644 index 00000000000..354feced2af --- /dev/null +++ b/language/classify_text/resources/texts/mr_fox.txt @@ -0,0 +1 @@ +Fantastic Mr Fox is a children's novel written by British author Roald Dahl. It was published in 1970, by George Allen & Unwin in the UK and Alfred A. Knopf in the U.S., with illustrations by Donald Chaffin. \ No newline at end of file diff --git a/language/classify_text/resources/texts/wireless.txt b/language/classify_text/resources/texts/wireless.txt new file mode 100644 index 00000000000..d742331c464 --- /dev/null +++ b/language/classify_text/resources/texts/wireless.txt @@ -0,0 +1 @@ +Wireless communication, or sometimes simply wireless, is the transfer of information or power between two or more points that are not connected by an electrical conductor. The most common wireless technologies use radio waves. \ No newline at end of file