Class WebcrawlerConnector.DocumentURLFilter

  • Enclosing class:
    WebcrawlerConnector

    protected class WebcrawlerConnector.DocumentURLFilter
    extends java.lang.Object
    This class describes the url filtering information (for crawling and indexing) obtained from a digested DocumentSpecification.
    • Constructor Summary

      Constructors 
      Constructor Description
      DocumentURLFilter​(org.apache.manifoldcf.core.interfaces.Specification spec)
      Process a document specification to produce a filter.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected java.lang.String findSpecifiedContent​(java.lang.String currentURI, java.util.List<java.util.regex.Pattern> patterns)  
      WebcrawlerConnector.CanonicalizationPolicies getCanonicalizationPolicies()
      Get canonicalization policies
      java.lang.String getVersionString()
      Get whatever contribution to the version string should come from this data.
      boolean isDocumentAndHostLegal​(java.lang.String url, org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities)
      Check if both a document and host are legal.
      boolean isDocumentContentIndexable​(java.lang.String documentIdentifier)  
      java.lang.String isDocumentIndexable​(java.lang.String url, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)
      Check if the document identifier is indexable, and return the indexing URL if found.
      boolean isDocumentLegal​(java.lang.String url, org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities)
      Check if the document identifier is legal.
      boolean isHostLegal​(java.lang.String host)
      Check if a host is legal.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • versionString

        protected java.lang.String versionString
        The version string
      • includePatterns

        protected final java.util.List<java.util.regex.Pattern> includePatterns
        The arraylist of include patterns
      • excludePatterns

        protected final java.util.List<java.util.regex.Pattern> excludePatterns
        The arraylist of exclude patterns
      • includeIndexPatterns

        protected final java.util.List<java.util.regex.Pattern> includeIndexPatterns
        The arraylist of index include patterns
      • excludeIndexPatterns

        protected final java.util.List<java.util.regex.Pattern> excludeIndexPatterns
        The arraylist of index exclude patterns
      • seedHosts

        protected java.util.Set<java.lang.String> seedHosts
        The hash map of seed hosts, to limit urls by, if non-null
      • excludeContentIndexPatterns

        protected final java.util.List<java.util.regex.Pattern> excludeContentIndexPatterns
        List of content exclusion pattern
    • Constructor Detail

      • DocumentURLFilter

        public DocumentURLFilter​(org.apache.manifoldcf.core.interfaces.Specification spec)
                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Process a document specification to produce a filter. Note that we EXPECT the regexp's in the document specification to be properly formed. This should be checked at save time to prevent errors. Any syntax errors found here will thus cause the include or exclude regexp to be skipped.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
    • Method Detail

      • getVersionString

        public java.lang.String getVersionString()
        Get whatever contribution to the version string should come from this data.
      • isDocumentAndHostLegal

        public boolean isDocumentAndHostLegal​(java.lang.String url,
                                              org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities)
                                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Check if both a document and host are legal.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • isHostLegal

        public boolean isHostLegal​(java.lang.String host)
        Check if a host is legal.
      • isDocumentLegal

        public boolean isDocumentLegal​(java.lang.String url,
                                       org.apache.manifoldcf.crawler.interfaces.IHistoryActivity activities)
                                throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Check if the document identifier is legal.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • isDocumentIndexable

        public java.lang.String isDocumentIndexable​(java.lang.String url,
                                                    org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)
                                             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Check if the document identifier is indexable, and return the indexing URL if found.
        Returns:
        null if the url doesn't match or should not be ingested, or the new string if it does.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • isDocumentContentIndexable

        public boolean isDocumentContentIndexable​(java.lang.String documentIdentifier)
                                           throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • findSpecifiedContent

        protected java.lang.String findSpecifiedContent​(java.lang.String currentURI,
                                                        java.util.List<java.util.regex.Pattern> patterns)
                                                 throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException